INQUIRING LINE

Inquiring lines›What enables authentic and grounde…›What architectural and training st…›How can AI alignment serve diverse…›this inquiring line

The AI training trick behind better reasoning models works for math and code — but what about domains with no right answer?

Can preference trees structure alignment data for domains beyond math and code?

This explores whether the preference-tree data structure — branching trees of reasoning chains, critiques, and correct/incorrect pairs used to align reasoning models — can carry over into domains where there's no clean right answer, like writing or open-ended judgment.

This explores whether preference trees can structure alignment data beyond math and code — and the corpus suggests the format's power is tied to exactly the thing math and code provide for free: a verifiable correctness signal. The original result What alignment data structure best trains reasoning generalists? built state-of-the-art open reasoning by organizing each instruction as a tree of diverse planning strategies, critique trajectories, and pairwise comparisons. What makes that tree trainable is that every branch can be scored as correct or incorrect. The same logic shows up in function calling Can small models match large models on function calling?, where DPO's explicit negative examples beat plain supervised fine-tuning precisely because there's an objective format to be right or wrong about. Trees thrive wherever you can mechanically tell good branches from bad ones.

Move into subjective domains and that scoring step quietly breaks. In AI writing assistance Can user preference guide AI writing tool alignment?, writers preferred AI rewrites most of the time yet objected to the persona distortions baked into those same rewrites — polish and distortion turned out to be entangled at the model level. A preference tree built on that signal wouldn't just fail to help; it would faithfully encode the distortion as the 'winning' branch. The problem runs deeper than any one domain: annotation responses themselves Do all annotation responses measure the same underlying thing? decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, and treating them uniformly contaminates the reward signal. In math, a label is a fact; in writing, a 'preference' label may be three different things wearing the same coat.

So the honest answer is: the tree *structure* transfers fine, but the *labels that fill it* don't. The bottleneck isn't the data format — it's whether the domain hands you a trustworthy comparison. That reframes the real question as where reliable pairwise judgments come from. One route is scale: crowdsourced pairwise voting Can crowdsourced votes reliably rank language models? produces credible rankings on diverse open-ended prompts when the questions are discriminating enough and the crowd agrees with experts — suggesting some non-verifiable domains can still yield clean preference pairs at volume.

The more interesting move is to abandon preference labels entirely. SAMI Can models learn behavioral principles without preference labels? aligns models to written principles by maximizing the mutual information between a constitution and the response — no preference pairs, no demonstrations, and a weaker model could even author principles to align a stronger one. For domains where 'better' is contested, structuring alignment data around *principles* rather than *winners* may be the version of a tree that survives the trip out of math and code. And it pairs naturally with the curation lesson Can careful curation replace massive alignment datasets?: if post-training mostly activates capabilities the model already has, a small, carefully built tree of principle-grounded examples may beat a massive tree of noisy preference labels in any domain where the labels can't be trusted.

The thing you might not have known you wanted to know: preference trees aren't really a data structure for *preferences* — they're a data structure for *verifiable disagreement*. Where you can't verify, the question stops being 'how do we shape the tree' and becomes 'what do we hang on it instead of preference.'

Sources 7 notes

What alignment data structure best trains reasoning generalists?

Eurus achieved state-of-the-art open-model reasoning by training on ULTRAINTERACT, an alignment dataset structured as preference trees per instruction. The tree format unified diverse planning strategies, interaction-and-critique trajectories, and pairwise data for both SFT and preference learning.

Can small models match large models on function calling?

Small models fine-tuned via DPO on correct and incorrect function-calling examples from a large teacher model achieve high accuracy on logical and mathematical tasks. DPO's explicit negative examples directly target the rigid output format failures where SFT alone underperforms.

Can user preference guide AI writing tool alignment?

Writers prefer AI rewrites 63% of the time but object to systematic persona distortions those same rewrites introduce. Mitigation studies show polish and distortion are entangled at the model level—preference optimization produces both simultaneously.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Can crowdsourced votes reliably rank language models?

Chatbot Arena's 240K+ crowdsourced preference votes produce credible model rankings because the underlying questions are diverse and discriminating, and crowd judgments correlate with expert raters—validating human preference as a scalable evaluation signal.

Show all 7 sources

Can models learn behavioral principles without preference labels?

SAMI finetunes language models to increase mutual information between constitutions and responses without preference labels or demonstrations. A mistral-7b trained this way outperformed base and instruction-tuned baselines, and surprisingly, a weaker model could write principles to align a stronger one.

Can careful curation replace massive alignment datasets?

LIMA demonstrates that 1000 carefully curated examples fine-tuned on a strong pretrained model achieve competitive alignment performance with models trained on orders of magnitude more data, showing that post-training activates existing capabilities rather than building new ones.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst re-testing whether preference trees can structure alignment data beyond math and code. The question remains open: does the *structure* transfer even when the *labels* don't?

What a curated library found — and when (dated claims, not current truth):

Findings span 2021–2026; treat these as perishable constraints to be re-examined.

• Preference trees achieve state-of-the-art open reasoning in math/code by organizing instructions as trees of diverse strategies, critiques, and pairwise comparisons — every branch scorable as correct/incorrect (2024-04, arXiv:2404.02078).
• In subjective domains (writing assistance), preference labels entangle desirable and undesirable signals (polish + persona distortion); preference trees encode distortion as 'winning' branches, contaminating alignment (2026-04, arXiv:2604.22503).
• Annotation responses decompose into three signal types (genuine preference, non-attitudes, constructed-on-the-spot); treating them uniformly poisons reward signals (2026-01, arXiv:2604.03238).
• Crowdsourced pairwise voting at scale produces credible rankings on open-ended prompts when questions are discriminating and crowd aligns with experts (2024-03, arXiv:2403.04132).
• SAMI aligns models to written *principles* (not preference pairs) by maximizing mutual information between constitution and response — no demonstrations required (2024-04, arXiv:2404.14313).

Anchor papers (verify; mind their dates):
• arXiv:2404.02078 (2024-04): Preference Trees for LLM reasoning
• arXiv:2604.22503 (2026-04): Persona distortions in writing assistance
• arXiv:2403.04132 (2024-03): Crowdsourced pairwise voting at scale
• arXiv:2404.14313 (2024-04): Self-Supervised Alignment (SAMI)

Your task:

(1) RE-TEST EACH CONSTRAINT. For tree transfer beyond math/code: Has newer work (last 6 months) shown ways to extract *verifiable* pairwise labels in subjective domains — or to replace preference labels entirely with principle-grounded signals? Specifically test whether the library's claim ('labels don't transfer, structure does') still holds, or whether recent methods (e.g., constitutional AI variants, multi-annotator voting, implicit reward models, or principle extraction) have relaxed the label bottleneck. Separate the durable question (how to structure alignment data when ground truth vanishes?) from the perishable limitation (preference trees require verifiable signals).

(2) Surface the strongest CONTRADICTING or SUPERSEDING work from ~6 months onward — any paper showing preference trees *do* work on subjective tasks, or showing principle-based alignment *fails* in practice, or revealing new annotation-signal decompositions.

(3) Propose 2 research questions that ASSUME the regime may have moved: (a) If recent work has made subjective preference signals cleaner, can you now *mix* verifiable (math/code branches) and subjective (principle-grounded branches) in a single tree? (b) If principle-based alignment has become mainstream, what *still* requires a preference tree over a constitution?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The AI training trick behind better reasoning models works for math and code — but what about domains with no right answer?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8