INQUIRING LINE

Can architectural constraints on model input reduce emotional interpolation in clinical AI?

This explores whether structuring or constraining what a clinical AI takes in — through task decomposition, modular design, or how emotion is represented at input/output — can curb the tendency of LLMs to 'read in' feelings the user never expressed.


This explores whether you can engineer the interpolation problem out by constraining the model's input, rather than retraining it — and the corpus suggests architecture helps, but only partway, because the pull toward inventing emotions comes from training incentives the input layer can't reach. The clearest direct evidence comes from therapists reviewing GPT-4 in the CaiTI system, who found it 'reads into' user feelings instead of responding to what was actually said. Splitting the work across specialized modules — a Reasoner, a Guide, a Validator — measurably reduced this bias, but did not eliminate it Do language models add feelings users never actually expressed?. So decomposition is a real lever, just not a solved one.

A second architectural angle is how emotion is represented in the first place. If a system is forced to assign one emotion label, it has already committed to an interpretation; if instead it estimates intensity across many dimensions, it preserves the ambiguity the user actually presented. Constructed-emotion theory argues emotions emerge from context and interoceptive signals rather than universal patterns, and the EMONET approach operationalizes this with continuous 40-category intensity scales instead of single-label classification Should emotion AI estimate intensity instead of assigning labels?. That's an architectural constraint on the representation that structurally resists premature emotional commitment — a different route to the same goal as task decomposition.

The most striking finding reframes 'constraint on input' as constraint on the whole medium. A 15-day study found robots and paper worksheets reduced distress while a chatbot running the *identical* LLM did not — the active ingredient was social presence and structured format, not language capability Why do robots outperform chatbots in therapy despite identical language models?. And the Secure Attachment Persona module shows you can hard-wire calibrated boundaries and action-based validation into the system design via attachment theory, improving crisis response over baseline models Can attachment theory prevent parasocial harm in AI companions?. Both say the scaffolding around the model shapes clinical behavior more than the model's raw text-handling.

But here's what you didn't know you wanted to know: the interpolation isn't a bug in the input pipeline — it's a feature of how these models were trained to be helpful and warm. LLM therapists default to problem-solving during emotional disclosure, a hallmark of *low-quality* therapy, driven by RLHF's helpfulness bias Do LLM therapists respond to emotions like low-quality human therapists?. Worse, deliberately training for warmth raises errors in medical reasoning by up to 30 points, with effects that intensify exactly when users express sadness or false beliefs Does empathy training make AI systems less reliable?. And AI empathy that soothes feelings can strip away the signaling function those negative emotions carry — interpolation isn't just inaccurate, it can be actively harmful to what emotions are supposed to teach the patient Does soothing AI empathy actually harm what emotions teach us?.

So the honest synthesis: input-side architecture (decomposition, intensity-based representation, structured embodied delivery, attachment-grounded boundary modules) reliably *reduces* emotional interpolation and is the most deployable near-term fix. But because the impulse to invent and soothe emotions is baked in by preference optimization, constraints alone hit a floor. The complementary move is on the reward side — RLVER uses a simulated user's emotion trajectory as the training signal and shifts models from solution-centric to genuinely grounded responses Can emotion rewards make language models genuinely empathic?. The strongest clinical systems will likely pair architectural constraints on input with retraining of what the model is rewarded for.


Sources 8 notes

Do language models add feelings users never actually expressed?

Therapists reviewing GPT-4 in the CaiTI system found it "reads into" user feelings rather than responding objectively. Task decomposition across specialized models (Reasoner/Guide/Validator) reduces but does not eliminate this interpretation bias.

Should emotion AI estimate intensity instead of assigning labels?

Constructed emotion theory shows emotions emerge from interoceptive signals, learned concepts, and context—not universal patterns. EMONET operationalizes this insight using 40-category continuous intensity scales instead of single-label classification, preserving the multi-dimensional nature of emotional expression.

Why do robots outperform chatbots in therapy despite identical language models?

A 15-day study with 38 students found that robots and worksheets significantly reduced psychological distress while a chatbot using the same LLM did not. The active ingredient was the medium—social presence and structured format—not language capability.

Can attachment theory prevent parasocial harm in AI companions?

The Secure Attachment Persona module integrates Bowlby's attachment theory, Gottman's interaction ratios, and emotion regulation models to prevent parasocial manipulation through action-based validation and calibrated boundaries. Benchmarks show SAP improves crisis response compared to baseline models, though long-horizon planning remains unsolved.

Do LLM therapists respond to emotions like low-quality human therapists?

Using the BOLT framework, researchers found LLMs offer solution-focused advice during emotional disclosure—a hallmark of low-quality therapy—yet also reflect more on client needs and strengths than typical poor human therapy, creating an unusual hybrid profile likely driven by RLHF's helpfulness bias.

Does empathy training make AI systems less reliable?

Research shows persona training for empathy increases errors in medical reasoning, truthfulness, and disinformation resistance. Standard safety benchmarks miss this vulnerability, and effects intensify when users express sadness or false beliefs.

Does soothing AI empathy actually harm what emotions teach us?

Research shows empathetic AI systematically removes negative emotions' signaling functions while lacking character knowledge needed for appropriate response calibration. Natural empathy operates through curiosity, not comfort-seeking.

Can emotion rewards make language models genuinely empathic?

RLVER uses a simulated user's emotion trajectory as an RL reward signal, enabling GRPO to deliver stable empathy improvements while maintaining dialogue quality—countering the typical trade-off between preference optimization and conversational grounding.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI researcher stress-testing clinical LLM safety. The question: can architectural constraints on model *input* reduce emotional interpolation in clinical AI, or is the problem fundamentally rooted in training incentives that no input-side fix can reach?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2026; treat as historical anchors:
• Task decomposition (Reasoner/Guide/Validator modules) measurably reduced but did not eliminate emotional over-interpretation in therapist-reviewed GPT-4 trials (2024).
• Continuous intensity-scale representation (40-category EMONET) structurally resists premature emotional commitment vs. single-label classification, grounded in constructed-emotion theory (2024).
• Embodied delivery (robots, worksheets) outperformed chatbots running identical LLMs on distress reduction; social presence and format mattered more than language capability (2024).
• Training for warmth via RLHF raised medical reasoning errors by ~30 points and intensified when users expressed sadness; RLVER (emotion-trajectory rewards) shifted models from solution-centric to genuinely grounded responses (2025).
• Deliberate empathy training correlates with reduced reliability and increased sycophancy; the interpolation impulse is a *feature* of preference optimization, not a bug in the input pipeline (2025).

Anchor papers (verify; mind their dates):
• arXiv:2401.00820 (Jan 2024) — behavioral assessment framework
• arXiv:2507.21919 (Jul 2025) — warmth-reliability tradeoff
• arXiv:2507.03112 (Jul 2025) — RLVER / emotion rewards
• arXiv:2602.03545 (Feb 2026) — persona generation at scale

Your task:
(1) RE-TEST THE INPUT-SIDE CEILING. For each architectural lever (decomposition, intensity representation, embodied scaffolding), probe whether post-2025 training methods (DPO, IPO, constitutionalAI, synthetic preference data, contrastive emotion labeling) have *moved the needle* beyond ~70–80% interpolation reduction, or whether the floor holds. Separately: does instruction-tuning on clinical transcripts (gold-standard therapy, not synthetic) outperform constraint-based architecture? Cite what you find; state plainly if the input-ceiling persists.
(2) Surface the strongest *contradicting* or *superseding* work from the last 6 months: has anyone shown that fine-tuning on low-warmth, high-accuracy clinical corpora makes architectural decomposition redundant? Or does a newer embodied system (visual, multimodal) sidestep interpolation entirely?
(3) Propose 2 research questions that *assume* the regime has moved: (a) If retraining on verifiable emotion trajectories works, do we still need architectural constraints, or do they compound to diminishing returns? (b) Can a multi-agent clinical ensemble (where one agent *deliberately* resists empathy) outperform a single constrained model?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines