Can LLMs predict demographics from social media usernames alone?
This explores whether web-browsing language models can infer personal attributes like gender, age, and political orientation from just a username and public profile. The finding matters because it reveals a privacy vulnerability that traditional API-based assumptions didn't anticipate.
Recent LLMs equipped with web-browsing tools can access social media profiles directly via the open web rather than through rate-limited or paid APIs, and this capability changes what information is practically available about a user. Evaluated on a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, web-browsing LLMs can predict demographic attributes — gender, age, political orientation — from usernames alone with reasonable accuracy. The privacy model that assumed bulk inference required API access and Terms of Service compliance no longer holds, because a single browsing-enabled LLM can perform the same inference per-user on demand.
The bias finding compounds the privacy concern. Analysis of the synthetic dataset reveals that the models introduce gender and political biases specifically against accounts with minimal activity. When the model has rich content to read it makes calibrated inferences; when content is sparse it falls back on stereotype-driven defaults associated with name patterns and limited cues. This means low-activity users — disproportionately women, marginalized groups, and the privacy-conscious — receive systematically more biased predictions than high-activity users, inverting the expectation that less data would yield more uncertain rather than more biased predictions. This sparse-persona failure mode is structurally similar to the one named in Why do LLM judges fail at predicting sparse user preferences?.
The dual-use framing matters. The capability is genuinely useful for computational social science in a "post-API era" where research datasets are harder to construct legally. But the same capability lowers the cost of targeted advertising, information operations, and personalized adversarial messaging — any actor that wants to demographically classify a list of usernames can now do so without infrastructure. The paper's call for safeguards is a recognition that the capability has already arrived; what is missing is the governance layer to constrain its misuse. The privacy-personalization tradeoff this opens is the population-scale version of Does chatbot personalization build trust or expose privacy risks? — but here the user did not even sign up for the inference.
Inquiring lines that use this note as a source 16
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Do personality inferences from text show the same demographic biases as norm predictions?
- Can post-hoc reranking improve fairness for demographic minorities in shared accounts?
- Can LLMs infer psychological profiles without explicit user disclosure?
- Which personalization techniques expose user data most directly?
- Why do language models infer political orientation from seemingly innocuous user signals?
- How does data scarcity in user populations amplify persona similarity errors?
- What social information is missing from language data?
- What data types carry the most privacy risk in personalization systems?
- Why do feature-based approaches struggle when privacy or latent factors are involved?
- Why do sparse user profiles trigger stereotype-driven demographic predictions?
- What governance safeguards could constrain misuse of demographic inference?
- How does direct web access change privacy assumptions built on API limits?
- Which user groups face highest bias risk from sparse-persona inference?
- Why do completion-oriented models systematically sacrifice privacy compliance?
- Can minimal privacy boundaries generalize beyond phone-use contexts?
- What sequential patterns emerge from anonymous single-session data?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does chatbot personalization build trust or expose privacy risks?
Explores whether personalization features that increase user trust and social connection simultaneously heighten privacy concerns and create rising behavioral expectations over time.
extends: same dual-use privacy/personalization dynamic; this note documents the population-scale outside-the-app variant where user has not opted in
-
Do reasoning traces actually expose private user data?
Explores whether language models leak sensitive information through their internal reasoning steps, even when explicitly instructed not to. Investigates the mechanisms and scale of privacy exposure in reasoning traces.
extends: another channel through which LLM capability creates a privacy attack surface that prior assumptions did not anticipate
-
Why do LLM judges fail at predicting sparse user preferences?
When LLMs judge user preferences based on limited persona information, what causes their predictions to become unreliable? Understanding persona sparsity's role in judgment failure could improve personalization systems.
extends: same sparse-persona failure pattern — confident wrong inferences specifically on minimal-data users; verbal uncertainty estimation could mitigate the bias-against-low-activity finding here
-
How do personalization granularity levels trade precision against scalability?
LLM personalization operates at user, persona, and global levels, each with different tradeoffs. Understanding these tradeoffs helps determine when to invest in individual user data versus broader patterns.
extends: web-browsing LLMs collapse the granularity hierarchy by allowing per-user inference without per-user data collection — undermining the data-availability constraint that disciplined personalization design
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics
- Large Language Models Can Infer Psychological Dispositions of Social Media Users
- Can Language Models Recognize Convincing Arguments?
- Large Language Models Reflect the Ideology of their Creators
- Tube2Vec: Social and Semantic Embeddings of YouTube Channels
- Generative Agent Simulations of 1,000 People
- LLM Generated Persona is a Promise with a Catch
- User-LLM: Efficient LLM Contextualization with User Embeddings
Original note title
web-browsing LLMs can infer demographics from social media usernames alone — privacy assumptions built around API access break when models can browse profiles directly