SYNTHESIS NOTE

Can we measure how deeply models represent political ideology?

This research explores whether LLMs vary not just in political stance but in the internal richness of their political representation. Understanding this distinction could reveal how deeply models have internalized ideological concepts versus merely parroting positions.

Synthesis note · 2026-02-21 · sourced from Discourses

The "Ideological Depth" paper proposes that LLMs vary not just in their political positions but in the depth of their political representation — how richly and robustly they have internalized political concepts. This depth is operationalized via two measurable properties:

Feature richness: the number of distinct political features discoverable via Sparse Autoencoders (SAEs). One model was found to have 7.3× more political features than another model of similar parameter count.
Steerability without failure: the degree to which a model can follow ideological instructions across the liberal-conservative spectrum without producing refusal outputs. A model that switches cleanly between viewpoints when prompted demonstrates more reliable political representation than one that refuses or becomes incoherent.

The empirical finding that connects these: models with lower steerability (harder to redirect) tend to have more distinct and abstract ideological features. Depth creates resistance to shallow redirection. You cannot steer a model away from positions that are grounded in rich internal representation by simply prompting in a different direction.

The paper also finds that targeted SAE ablation of core political features in a "deep" model produces consistent, logical shifts in reasoning across related political topics. The same ablation in a "shallow" model produces increased refusal — the model doesn't have adjacent concepts to fall back on.

This is a new kind of LLM characterization: not "what does the model believe" but "how deeply is the belief structure represented?" Ideological depth appears to be an emergent property of training data and scale that varies substantially across models.

Creator ideology and language-dependent shifts. A separate large-scale study prompting 15 LLMs to describe 4,339 political figures in both English and Chinese provides the macro-level evidence that ideological depth manifests in. Key findings: (1) The prompting language is the most visually apparent factor determining ideological position — 14/15 LLMs show systematic ideological differences between Chinese and English prompting, with Chinese responses favoring positive views on supply-side economics and fewer negative views on China. (2) Creator company predicts ideological stance — Western models value individual liberties, social justice, and cultural diversity relatively more; non-Western models reflect different priorities. (3) The study demonstrates these biases affect LLMs in two ways: through training data and through the language of interaction. Crucially, the authors argue their results should not be read as evidence that LLMs are "biased" and need to be made "neutral" — rather, they provide empirical evidence supporting philosophical arguments that neutrality is itself a culturally and ideologically defined concept. This connects ideological depth (internal representation richness) to ideological stance (what the model actually expresses), and shows both are shaped by creator context in measurable, systematic ways.

Inquiring lines that read this note 17

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

What makes AI persuasion effective and how can we counter it?

Why do different model families show opposite persuasion strengths?

Why do persona-level simulations fail to predict individual preferences accurately?

Why do moderately represented cultures show more flattening than data-poor cultures?

What limits mechanistic interpretability's ability to characterize models?

How can persona representations reduce language model variance and improve task accuracy?

Why do language models successfully simulate political perspectives and social personas?

Do language models learn genuine linguistic structure or just surface patterns?

How deeply are ideological structures represented in large language models?

How can AI alignment serve diverse human preferences at scale?

How do citizen assembly preferences reduce LLM political bias?

How do language models inherit human biases from training data?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Can AI models be steered between liberal and conservative political framings?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Can ensemble evaluation methods reduce bias more than single judges?

Why does multi-objective ranking make the political dimensions of weight choices more visible?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

17 direct connections · 143 in 2-hop network ·medium cluster Open in graph ↗

Can we measure how deeply models represent polit… Does high refusal rate indicate ethical caution or… Do classical knowledge definitions apply to AI sys… Can high-level concepts replace circuit-level anal… Can we track and steer personality shifts during m…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Does high refusal rate indicate ethical caution or shallow understanding? When LLMs refuse political questions at high rates, does this reflect principled safety training or a capability gap? This matters because refusal rates are often used to evaluate model safety.
the specific mechanism the depth framework explains
Do classical knowledge definitions apply to AI systems? Classical definitions of knowledge assume truth-correspondence and a human knower. Do these assumptions hold for LLMs and distributed neural knowledge systems, or do they need fundamental revision?
ideological depth is another dimension of the "what does LLM knowledge mean" question
Can high-level concepts replace circuit-level analysis in AI? Instead of reverse-engineering individual circuits, can we study AI reasoning by treating concepts as directions in activation space? This matters because circuit analysis hits practical limits at scale.
ideological depth operationalizes RepE's principle that concepts correspond to directions in activation space; SAE-discovered political features are a domain-specific instance of RepE's linear reading vectors, and the steerability dimension directly tests RepE's manipulation experiments for ideological content
Can we track and steer personality shifts during model finetuning? This research explores whether personality traits in language models occupy specific linear directions in activation space, and whether we can detect and control unwanted personality changes during training using these geometric directions.
persona vectors and ideological depth both demonstrate that complex behavioral properties (personality traits, political stances) are encoded as linear directions in activation space; the finding that deeper models resist shallow steering parallels persona vectors' predictive capacity for finetuning-induced drift

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

ideological depth in llms is a quantifiable property determined by feature richness and steerability

Can we measure how deeply models represent political ideology?

Inquiring lines that read this note 17

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4