How does textual-only feedback limit what a persona can learn about users?
This explores a specific bottleneck: when a persona learns only from text-based signals (written feedback, dialogue transcripts), what kinds of user information slip through the cracks?
This explores how relying on text alone — written feedback, conversation logs, thumbs-up labels — bounds what a persona can actually pick up about a user. The corpus suggests the limit isn't mainly volume of text but the *type* of information text carries. The sharpest framing comes from work showing that natural feedback splits into two orthogonal channels: an evaluative one (how good was that response?) and a directive one (how should it change?). Scalar or coarse textual signals reliably capture the first and quietly discard the second, so a persona trained on them learns *that* it was wrong without learning *which way* to move Can scalar rewards capture all the information in agent feedback?. That missing directional content is exactly the part a persona needs to adapt rather than just score itself.
A second limit is sparsity. When the textual trace of a user is thin, it simply lacks the predictive power to anchor specific preferences — LLM judges built on sparse personas become unreliable, and the honest fix is to let the model abstain on low-confidence cases rather than force a guess Why do LLM judges fail at predicting sparse user preferences?. The implication: textual feedback degrades gracefully only if the system knows when it doesn't know. Active approaches push against this by *choosing* which text to elicit — asking a handful of maximally informative questions to pin down a user's reward coefficients, so ten well-targeted queries outperform a large pile of incidental text Can user preferences be learned from just ten questions?.
Text also flattens users in a way that costs accuracy. Several recommendation papers argue a single user is really *multiple* personas, and which one is active depends on the moment and the item in front of them; collapsing that into one textual profile loses the candidate-conditional structure that both improves predictions and explains them Can modeling multiple user personas improve recommendation accuracy? Can attention mechanisms reveal which user taste explains each recommendation?. A persona learning from undifferentiated text tends toward an averaged-out user rather than the situational one.
There's a deeper limit the corpus surfaces almost as a warning: text-only learning can fake competence. When one model voices all sides of an interaction, social simulations look fluent — but they collapse the moment agents hold private information the text never states, revealing that the model was skipping the grounding work real understanding requires Why do LLMs fail when simulating agents with private information?. Persona drift is the same gap seen over time: textual supervision rewards correct lines but never punishes contradictions, so consistency erodes unless you add an explicit penalty signal or invert the setup to train the simulator itself Why does supervised learning fail to enforce persona consistency? Can training user simulators reduce persona drift in dialogue?.
The through-line worth carrying away: text isn't a neutral pipe. It under-carries direction, thins out under sparsity, averages away a user's multiplicity, and lets models simulate grounding they never did. The most promising counters in the corpus don't add *more* text — they add a different channel: directive signals recovered at the token level, abstention under uncertainty, actively chosen questions, contradiction penalties, and personas that evolve through simulated interaction rather than passive reading Can personas evolve in real time to match what users actually want?.
Sources 9 notes
Natural feedback carries two orthogonal types of information: evaluative (how well an action performed) and directive (how it should change). Scalar rewards capture evaluation but discard directional specifics that token-level distillation can recover, making the two complementary rather than redundant.
Sparse persona information lacks predictive power for specific preferences, causing LLM judges to fail. Verbal uncertainty estimation recovers reliability above 80% on high-certainty samples by allowing abstention rather than forced judgment.
PReF learns base reward functions from preference data, then uses active learning to select maximally informative questions that reduce coefficient uncertainty. Users can be personalized via inference-time reward alignment without weight modification.
AMP-CF separates user representation into latent personas weighted by attention to the candidate item. This candidate-conditional approach improves accuracy by adapting the user representation at prediction time and produces inherent explanations for why items were recommended.
AMP-CF represents each user as multiple latent personas weighted dynamically by candidate item. This makes recommendations both diverse and interpretable—each suggestion traces to the specific persona preference it satisfies—without requiring post-hoc reranking.
Research shows LLMs perform well when one model controls all interlocutors but fail systematically when agents possess private information. This reveals that apparent social competence relies on grounding work that models skip in omniscient settings.
Supervised learning cannot enforce persona consistency because it rewards correct responses but never penalizes contradictions. Offline reinforcement learning combines inexpensive training on existing data with explicit contradiction rewards using human-annotated labels, offering a practical alternative to expensive online RL.
By inverting standard RL setups to train user simulators for consistency using three complementary metrics (prompt-to-line, line-to-line, Q&A consistency) as reward signals, persona drift decreases by over 55%. This approach captures distinct failure types: local drift within turns, global drift across conversations, and factual contradictions.
PersonaAgent uses structured personas to bridge episodic/semantic memory and personalized actions, optimizing them at test time by simulating recent interactions against textual feedback. Learned personas cluster meaningfully in latent space, suggesting genuine user-specific separation beyond standard post-training drift.