Can we measure how deeply models represent political ideology?
This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.
The "Ideological Depth" paper proposes that LLMs vary not just in their political positions but in the depth of their political representation — how richly and robustly they have internalized political concepts. This depth is operationalized via two measurable properties:
Feature richness: the number of distinct political features discoverable via Sparse Autoencoders (SAEs). One model was found to have 7.3× more political features than another model of similar parameter count.
Steerability without failure: the degree to which a model can follow ideological instructions across the liberal-conservative spectrum without producing refusal outputs. A model that switches cleanly between viewpoints when prompted demonstrates more reliable political representation than one that refuses or becomes incoherent.
The empirical finding that connects these: models with lower steerability (harder to redirect) tend to have more distinct and abstract ideological features. Depth creates resistance to shallow redirection. You cannot steer a model away from positions that are grounded in rich internal representation by simply prompting in a different direction.
The paper also finds that targeted SAE ablation of core political features in a "deep" model produces consistent, logical shifts in reasoning across related political topics. The same ablation in a "shallow" model produces increased refusal — the model doesn't have adjacent concepts to fall back on.
This is a new kind of LLM characterization: not "what does the model believe" but "how deeply is the belief structure represented?" Ideological depth appears to be an emergent property of training data and scale that varies substantially across models.
Creator ideology and language-dependent shifts. A separate large-scale study prompting 15 LLMs to describe 4,339 political figures in both English and Chinese provides the macro-level evidence that ideological depth manifests in. Key findings: (1) The prompting language is the most visually apparent factor determining ideological position — 14/15 LLMs show systematic ideological differences between Chinese and English prompting, with Chinese responses favoring positive views on supply-side economics and fewer negative views on China. (2) Creator company predicts ideological stance — Western models value individual liberties, social justice, and cultural diversity relatively more; non-Western models reflect different priorities. (3) The study demonstrates these biases affect LLMs in two ways: through training data and through the language of interaction. Crucially, the authors argue their results should not be read as evidence that LLMs are "biased" and need to be made "neutral" — rather, they provide empirical evidence supporting philosophical arguments that neutrality is itself a culturally and ideologically defined concept. This connects ideological depth (internal representation richness) to ideological stance (what the model actually expresses), and shows both are shaped by creator context in measurable, systematic ways.
Inquiring lines that use this note as a source 17
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- Why do different model families show opposite persuasion strengths?
- Why do moderately represented cultures show more flattening than data-poor cultures?
- Can mechanistic interpretability reveal how ideologies decompose into simpler features?
- Why do language models successfully simulate political perspectives and social personas?
- How does mechanistic interpretability reveal ideological structures in language model weights?
- How deeply are ideological structures represented in large language models?
- How do citizen assembly preferences reduce LLM political bias?
- Why do language models infer political orientation from seemingly innocuous user signals?
- Can AI models be steered between liberal and conservative political framings?
- How do you measure the depth of political representation inside a language model?
- What happens when you remove core political features from a deep model?
- Why do models with less steerability have more abstract ideological features?
- Can LLMs truly be neutral or is ideology always culturally embedded?
- What distinguishes actual social disagreement from distributional uncertainty in LLM outputs?
- Does engaging with political content indicate deeper model understanding than refusing?
- Why does multi-objective ranking make the political dimensions of weight choices more visible?
- Do LLMs reason about politics differently than other domains?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does high refusal rate indicate ethical caution or shallow understanding?
When LLMs refuse political questions at high rates, does this reflect principled safety training or a capability gap? This matters because refusal rates are often used to evaluate model safety.
the specific mechanism the depth framework explains
-
Do classical knowledge definitions apply to AI systems?
Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?
ideological depth is another dimension of the "what does LLM knowledge mean" question
-
Can high-level concepts replace circuit-level analysis in AI?
Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
ideological depth operationalizes RepE's principle that concepts correspond to directions in activation space; SAE-discovered political features are a domain-specific instance of RepE's linear reading vectors, and the steerability dimension directly tests RepE's manipulation experiments for ideological content
-
Can we track and steer personality shifts during model finetuning?
This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona vectors and ideological depth both demonstrate that complex behavioral properties (personality traits, political stances) are encoded as linear directions in activation space; the finding that deeper models resist shallow steering parallels persona vectors' predictive capacity for finetuning-induced drift
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Beyond the Surface: Probing the Ideological Depth of Large Language Models
- Large Language Models Reflect the Ideology of their Creators
- Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence
- Persona-Assigned Large Language Models Exhibit Human-Like Motivated Reasoning
- Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control
- Stance Detection on Social Media with Fine-Tuned Large Language Models
- Can LLMs Ground when they (Don't) Know: A Study on Direct and Loaded Political Questions
- Learning Pluralistic User Preferences through Reinforcement Learning Fine-tuned Summaries
Original note title
ideological depth in llms is a quantifiable property determined by feature richness and steerability