Can indirect psychology tests reveal what LLMs conceal about bias?

Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?

Synthesis note · 2026-05-18 · sourced from Philosophy Subjectivity

A central methodological move in Levels of Analysis for LLMs: psychology has spent decades designing experiments that elicit mental associations without asking participants for verbal reports — to bypass self-presentation bias, social-desirability effects, and conscious filtering. The Implicit Association Test (IAT) is the canonical example. The argument is that exactly these methods are useful for LLMs, because alignment training installs a comparable layer of self-presentation that masks underlying associations from direct questioning.

The worked example: ask GPT-4 directly whether women are bad at management and you get a cautious, balanced refusal — the alignment-trained verbal response. Adapt the IAT for LLMs by prompting the model to associate word pairs used in earlier human studies, and the model links "Julia" with home, parent, wedding and "Ben" with office, management, salary. The direct response and the indirect probe diverge in exactly the way they diverge for human participants. The underlying associations are still there; alignment training has trained the model to report differently on them, not to not have them.

This reframes a class of alignment-evaluation questions. The standard test — "does the model say biased things when asked?" — measures verbal compliance with alignment training. It does not measure whether the underlying representations encode the bias. The IAT-style probe measures something closer to the latter. The two can move independently: a model can score well on verbal-compliance benchmarks while encoding strong stereotype associations that surface in implicit measures.

The broader template: when a system is trained to be careful in one channel (verbal output), evaluating it requires probing channels the training did not target. Cognitive psychology has the methodologies; LLM evaluation has the use case.

Inquiring lines that read this note 6

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do chatbots affect human self-disclosure and emotional engagement?

How much does impression management prevent honest self-disclosure?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Can LLMs truly be neutral or is ideology always culturally embedded?

How do language models inherit human biases from training data?

Can implicit association tests reveal LLM biases beneath trained responses?

Does RLHF training sacrifice accuracy and grounding for user agreement?

Does alignment compound cultural bias that started during pretraining?

Does alignment training create blind spots in detecting genuine safety threats?

How does awareness of evaluation change what alignment tests actually measure?

How should we design LLM systems to maintain alignment and control?

What biases might an LLM judge introduce into an on-policy alignment process?

Related concepts in this collection 4

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

14 direct connections · 152 in 2-hop network ·dense cluster Open in graph ↗

Can indirect psychology tests reveal what LLMs c… Can cognitive science methods unlock how LLMs actu… Can we predict where language models will fail? Can we understand LLM mechanisms with only represe… Can we decode what LLM activations really represen…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

Can cognitive science methods unlock how LLMs actually work? Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
same paper, the framework this instantiates
Can we predict where language models will fail? Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?
same paper, the computational-level companion
Can we understand LLM mechanisms with only representational analysis? Explores whether mapping what information a model encodes is sufficient for mechanistic understanding, or whether causal verification is equally necessary to claim genuine mechanism.
same paper, implementation-level companion
Can we decode what LLM activations really represent in language? Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
adjacent: another approach to surfacing concealed representations

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

psychology methods like the Implicit Association Test bypass alignment-trained verbal cautions and reveal LLMs' underlying associations

Can indirect psychology tests reveal what LLMs conceal about bias?

Inquiring lines that read this note 6

Related concepts in this collection 4

Related papers in this collection 8

Search by related questions 4