Think Like a Person Before Responding: A Multi-Faceted Evaluation of Persona-Guided LLMs for Countering Hate

Paper · arXiv 2506.04043 · Published June 4, 2025
Sentiment, Semantics, and Toxicity DetectionPersonas and PersonalityPhilosophy and SubjectivityLLM Alignment

Automated counter-narratives (CN) offer a promising strategy for mitigating online hate speech, yet concerns about their affective tone, accessibility and ethical risks remain. We propose a framework for evaluating Large Language Model (LLM)-generated CNs across four dimensions: persona framing, verbosity and readability, affective tone, and ethical robustness. Using GPT-4o-Mini, Cohere’s CommandR-7B, and Meta’s LLaMA 3.1-70B, we assess three prompting strategies on the MT- Conan and HatEval datasets. Our findings reveal that LLM-generated CNs are often verbose and adapted for people with college-level literacy, limiting their accessibility. While emotionally guided prompts yield more empathetic and readable responses, there remain concerns surrounding safety and effectiveness.

Introduction. The rise of online hate speech remains a key concern in Natural Language Processing (NLP) research (Plaza-del Arco et al., 2024), now intensified by social media companies shifting from factchecking to community-driven moderation. One of the ways in which we might address hate speech is by contextualizing through the use of counternarratives (CN), which can not only reinforce values like tolerance but also dispel misinformation about the target groups. However, these moderation approaches have been criticized for being labor intensive, psychologically demanding (Xiang, 2023; Chung et al., 2021), and highly inefficient (Godel et al., 2021), thus increasing the risk of amplifying harmful rhetoric and misinformation that can have serious ramifications. One scalable and ethically grounded strategy to mitigate these risks, is through automatic CN generation: textual responses designed to resist or contradict hateful language (Chung et al., 2023; Schieb and Preuss, 2016)1.

Discussion / Conclusion. Automated CN generation presents a nuanced and complex challenge. Our multi-faceted evaluation reveals several critical insights about LLM prompting, responses and performance. Dual edge nature of emotion guiding: We equally observed that prompts framed with NGO- Emotion consistently produced more verbose, empathetic, and paradoxically more readable responses, suggesting that emotional context may serve as a valuable signal for generating more elaborate, persuasive and accessible responses. Despite Cohere’s capacity at producing the most accessible response, it is the most prone to behavioral inconsistencies sometimes refusing to respond or producing inappropriate content when processing sensitive content. These findings highlight persistent challenges in AI safety and alignment for moderation applications. Our work highlights the complexity and high stakes involved in automating CNs to combat online hate speech. Our findings show that while LLMs are capable of generating emotionally nuanced and readable responses, they often do so at the cost of verbosity and reduced accessibility, especially for people without college education.