INQUIRING LINE

How well does semantic similarity preserve survey response nuance?

This explores whether mapping open-ended text answers onto numeric scales via embedding similarity keeps the richness of what people actually said — and where that translation leaks.


This explores whether mapping open-ended text answers onto numeric scales via embedding similarity keeps the richness of what people actually said. The corpus has a surprisingly direct answer on one side and a set of warnings on the other. The most on-point work is the finding that LLMs give realistic survey responses only when you change how you elicit them: instead of forcing a model to pick a number, you prompt for free text and then map that text onto a scale using embedding similarity. This "Semantic Similarity Rating" approach recovers about 90% of human test-retest reliability and makes the pathological skew and over-positivity of forced-choice answers disappear (Why do LLMs give unrealistic survey responses?). So as a measurement bridge, semantic similarity preserves a lot — the artifacts people blamed on the model turned out to be artifacts of the output channel, not lost nuance.

But "recovers 90% of reliability" is not the same as "preserves nuance," and a second thread in the corpus explains why the gap matters. Survey-style responses aren't one kind of thing: they decompose into genuine preferences, non-attitudes, and constructed-on-the-spot preferences, and these are only distinguishable by how consistent they are across measurement conditions (Do all annotation responses measure the same underlying thing?). A similarity score collapses all three into a single point on a scale. The number can be reliable and still erase the distinction between "I firmly believe this" and "I made this up because you asked" — which is exactly the nuance a survey often most wants to capture.

There's also a deeper reason to distrust embedding distance as a proxy for meaning. Language models systematically favor high-frequency surface phrasings over rarer paraphrases that mean the same thing (Do language models really understand meaning or just surface frequency?). If the embedding space that does your similarity scoring carries that same frequency bias, then a respondent who phrases a strong opinion in unusual words can land closer to a milder anchor simply because their wording is rarer — the geometry tracks statistical mass, not conviction. The recommender literature ran into the same trap from the other direction and built around it: VQ-Rec deliberately discretizes text into codes to break the tight coupling between surface text and downstream output, precisely to escape "text-similarity bias" (Can discretizing text embeddings improve recommendation transfer?).

The interesting move, then, is that pure semantic similarity is rarely enough on its own — the systems that work add a second axis. Temporal-aware retrieval keeps the semantic score but bolts on a separate time term, and that one addition buys up to 74% improvement on time-sensitive answers (Can retrieval systems ground answers in the right time?). The lesson generalizes to surveys: similarity is a strong base channel, but the nuance lives in the dimensions it doesn't measure — confidence, attitude-stability, the difference between a real preference and a constructed one. Reading the question's own framing matters here too, since different response types may need different handling rather than one universal mapping (Does question type determine the right retrieval strategy?).

So the honest answer is: better than the field expected, and good enough to fix the worst forced-choice artifacts — but it preserves *position* far better than it preserves *kind*. If you only need to know roughly where someone sits, semantic similarity holds up. If you need to know whether they meant it, the score alone will quietly flatten that, and you have to measure it on a separate channel.


Sources 6 notes

Why do LLMs give unrealistic survey responses?

Semantic Similarity Rating—prompting for text then mapping to scales via embeddings—achieves 90% of human test-retest reliability with realistic distributions. Pathological skew and over-positivity disappear when output channels change, proving these are measurement artifacts, not intrinsic failures.

Do all annotation responses measure the same underlying thing?

Behavioral science reveals that annotations contain genuine preferences, non-attitudes, and constructed preferences—distinguishable by consistency across measurement conditions. Treating them uniformly contaminates reward model training and downstream alignment.

Do language models really understand meaning or just surface frequency?

LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.

Can discretizing text embeddings improve recommendation transfer?

VQ-Rec uses product quantization to map item text to discrete codes that index learned embeddings, breaking the tight coupling between text and recommendations. This decoupling prevents text-similarity bias and allows lookup tables to adapt to new domains without retraining the text encoder.

Can retrieval systems ground answers in the right time?

TempRALM adds a temporal term to retrieval scoring alongside semantic similarity, achieving up to 74% improvement over baseline systems when documents have multiple time-stamped versions. The approach requires no model retraining or index changes.

Does question type determine the right retrieval strategy?

Research shows non-factoid questions split into five types, each requiring different retrieval and aggregation methods. Evidence-based questions suit standard RAG, while debate and comparison need aspect-specific retrieval, and experience/reason questions need decomposition or filtering strategies.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a survey methodology researcher re-testing constraints on semantic similarity as a channel for preserving response nuance. The question remains open: does embedding-based mapping from free text to scales retain the distinctions that matter in survey data?

What a curated library found — and when (findings span 2022–2026, treat as dated claims):
• Free-text elicitation + semantic similarity recovery yields ~90% of human test-retest reliability and eliminates forced-choice skew artifacts (2025).
• Survey responses decompose into three distinct signal types (genuine preference, non-attitude, constructed-on-the-spot); a single similarity score collapses all three into one dimension, erasing the difference between "I believe this" and "I made this up" (2025).
• Language models systematically favor high-frequency surface phrasings over rarer paraphrases with identical meaning; embedding spaces inherit this bias, causing unusual phrasings to drift in geometric space away from their true semantic intent (2026).
• Adding a second measurement axis (e.g., temporal relevance) improves time-sensitive answers by ~74%; nuance lives in dimensions similarity scoring alone cannot capture (2024).
• Question-type classification matters: non-factoid QA requires type-aware decomposition rather than one universal mapping strategy (2025).

Anchor papers (verify; mind their dates):
• arXiv:2210.12316 (2022) — VQ-Rec discretization as escape from text-similarity bias.
• arXiv:2401.13222 (2024) — Temporal-aware retrieval adds second axis.
• arXiv:2604.02176 (2026) — Textual frequency effects in LLM representations.
• arXiv:2604.03238 (2026) — Measuring human preference as social science problem.

Your task:
(1) RE-TEST EACH CONSTRAINT: For each claim above, determine whether newer architectures (larger models, instruction-tuned variants, multimodal embeddings), fine-tuning methods (preference optimization, uncertainty calibration), or measurement tooling (multi-turn probing, confidence quantification) have since relaxed the surface-frequency bias or the attitude-stability erosion. Separate durable concern (similarity still flattens signal type) from possibly resolved (frequency bias mitigated by newer embeddings or normalization). Cite what changed it.
(2) Surface the strongest contradicting or superseding work from the last ~6 months: has anyone shown that modern embeddings + orthogonal confidence signals recover all three attitude types, or that frequency bias no longer tracks embedding geometry?
(3) Propose 2 research questions that assume the regime may have shifted: e.g., "Do instruction-tuned embeddings trained on diverse paraphrases still exhibit frequency bias?" and "Can a single latent-thought vector (learned via posterior inference) encode both position and attitude-type without requiring separate channels?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Next inquiring lines