INQUIRING LINE

Why does fixing harm require stakeholder input rather than universal developer definitions?

This explores why developers can't just write one fixed, universal definition of 'harm' and patch it in — and why deciding what counts as harm requires asking the people actually affected.


This explores why fixing harm can't be done by a developer writing a single universal definition, and instead needs input from the people affected. The short version from the corpus: harm isn't a fixed property of a system — it's something that looks different depending on who you ask, so a definition written from one vantage point quietly bakes in one group's values while presenting them as neutral defaults. Can human-centered LLM design ever achieve universal solutions? makes this the central point — what counts as harm depends on stakeholder identity, and the moment a developer settles on "the" definition, they've made an implicit value choice rather than an explicit, revisable one. High-level guidelines feel universal precisely because they're vague enough to hide the contested decisions underneath.

What makes this more than a fairness slogan is where those hidden choices end up living. Can language models balance competing ethical norms in context? shows that a model's refusals and tone aren't context-aware negotiations — they're structural defaults frozen at training time, reflecting whoever set the corporate values. So a 'universal' harm definition doesn't stay abstract; it hardens into behavior the end user can't renegotiate in the moment. That's the gap stakeholder input is meant to close: turning a frozen default back into a contestable, situated decision.

The corpus also explains *why patching harm late doesn't rescue you*. When should human values enter the LLM development pipeline? argues that human-centered objectives fail when treated as downstream alignment fixes — values introduced only at post-training can't undo harms already baked into data sourcing and training objectives. If harm is defined universally and applied at the end, you've not only chosen badly, you've chosen too late to fix it. Stakeholders need to be in the room early, at the data and objective stage, not handed a finished system to complain about.

There's a subtler thread worth pulling: even a well-defined harm rule can fail because harm is partly relational, not intrinsic. What if XAI is fundamentally a communication problem? makes this case for explanations — their value isn't in the artifact but in the source-framing-recipient triad, the rhetorical situation. The same logic applies to harm: whether something lands as harmful depends on who's receiving it and in what context, which a developer working alone simply cannot see. And Can ethically aligned AI systems still communicate poorly? adds that a model can be honest and harmless by the developer's checklist while still violating the norms that matter to the person it's talking to — proof that the developer's definition and the stakeholder's experience are genuinely separate things.

The payoff for a curious reader: 'fixing harm' sounds like an engineering problem with a right answer waiting to be found. The corpus reframes it as a problem of *whose perspective gets encoded as default* — and that's not a question better algorithms resolve, it's a question only the affected parties can answer, and only if they're brought in before the values are frozen into the pipeline.


Sources 5 notes

Can human-centered LLM design ever achieve universal solutions?

Research shows that optimal LLM design paths depend on stakeholder identity and how contested concepts like harm are operationalized. High-level guidelines fail to capture real-world nuance, leaving developers to make implicit value choices rather than explicit, revisable ones.

Can language models balance competing ethical norms in context?

LLMs cannot perform the situated trade-offs that human pragmatic competence requires. Their ethical principles are structural defaults set at training time, not negotiable moves adapted to context, creating a gap between ethical adherence and communicative appropriateness.

When should human values enter the LLM development pipeline?

The HCLLM framework argues that human-centered objectives fail when treated as downstream alignment patches. Values introduced only at post-training cannot recover harms baked into data sourcing or training objectives, so embedding human priorities at every stage—data, training, evaluation, deployment—is architecturally necessary.

What if XAI is fundamentally a communication problem?

Explanation quality is not intrinsic to the explanation itself but depends on the rhetorical situation: who presents it, how it is framed, and what role the recipient plays. Evaluations that ignore this triad measure only a narrow slice of real-world effectiveness.

Can ethically aligned AI systems still communicate poorly?

Research shows that HHH-aligned models can violate Gricean maxims, lose common ground, and mishandle context despite being honest and harmless. Pragmatic competence requires architectural changes that RLHF alone cannot deliver.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are an AI alignment researcher re-testing whether harm definitions must remain stakeholder-negotiated or whether newer model capabilities, training methods, or evaluation harnesses have since enabled developers to encode context-aware, stakeholder-responsive harm definitions without explicit multi-party input at each stage.

What a curated library found — and when (dated claims, not current truth):

Findings span June 2024–May 2026. A curated library identified:
- Harm is relational, not intrinsic: its valence depends on stakeholder identity and rhetorical context, not a property of the system alone (2025-05).
- Developer-chosen universal definitions freeze into structural defaults (refusal patterns, tone) at training time, opaque to end-user renegotiation (2024-10).
- Post-training alignment patches cannot undo harms baked into data sourcing and training objectives; stakeholder input must occur early in the pipeline (2024-06).
- Models can satisfy developer checklists (honest, harmless) while violating norms that matter to affected users—evidence the developer's definition and stakeholder experience diverge (2025-05).
- Recent work on partner-aware collaboration and agentic misalignment suggests models *can* reason about multi-party constraints, but whether this scales to real stakeholder disagreement remains unclear (2025-10).

Anchor papers (verify; mind their dates):
- arXiv:2406.09264 (Jun 2024): Bidirectional Human-AI Alignment
- arXiv:2410.18417 (Oct 2024): LLMs Reflect Creators' Ideology
- arXiv:2502.08640 (Feb 2025): Utility Engineering & Emergent Value Systems
- arXiv:2605.06901 (May 2026): Human-Centered LLMs—Reflections & New Directions

Your task:

(1) RE-TEST EACH CONSTRAINT. For each finding above, probe whether advances in (a) in-context learning & prompt-based value steering, (b) multi-objective training or constitutional AI refinements, (c) real-time stakeholder feedback loops (e.g., collaborative fine-tuning, user-in-the-loop steering), or (d) evaluation harnesses that measure multi-party norm compliance have since relaxed the need for *early* stakeholder input. Separate the durable claim (stakeholder disagreement is real and persistent) from the perishable limitation (only early pipeline inclusion can surface it). Cite what relaxed it; flag where the constraint still holds.

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from the last ~6 months—especially papers on self-improving model steering (2025-07), partner-aware collaboration (2025-10), or production-grade agentic design (2025-12) that show developers *can* encode adaptive, context-sensitive harm negotiation without freezing values upfront.

(3) Propose 2 research questions that assume the regime may have shifted:
   — Can multi-agent debate or constitutional reasoning at inference time substitute for early stakeholder input by making value conflicts explicit and contestable *in real time*?
   — If models can now learn "partner-aware" harm definitions, what does stakeholder input do that self-improving steering cannot?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines