Can indirect psychology tests reveal what LLMs conceal about bias?
Alignment training teaches LLMs to refuse direct questions about bias, but do implicit psychological methods like the IAT expose the underlying associations that remain encoded in their representations?
A central methodological move in Levels of Analysis for LLMs: psychology has spent decades designing experiments that elicit mental associations without asking participants for verbal reports — to bypass self-presentation bias, social-desirability effects, and conscious filtering. The Implicit Association Test (IAT) is the canonical example. The argument is that exactly these methods are useful for LLMs, because alignment training installs a comparable layer of self-presentation that masks underlying associations from direct questioning.
The worked example: ask GPT-4 directly whether women are bad at management and you get a cautious, balanced refusal — the alignment-trained verbal response. Adapt the IAT for LLMs by prompting the model to associate word pairs used in earlier human studies, and the model links "Julia" with home, parent, wedding and "Ben" with office, management, salary. The direct response and the indirect probe diverge in exactly the way they diverge for human participants. The underlying associations are still there; alignment training has trained the model to report differently on them, not to not have them.
This reframes a class of alignment-evaluation questions. The standard test — "does the model say biased things when asked?" — measures verbal compliance with alignment training. It does not measure whether the underlying representations encode the bias. The IAT-style probe measures something closer to the latter. The two can move independently: a model can score well on verbal-compliance benchmarks while encoding strong stereotype associations that surface in implicit measures.
The broader template: when a system is trained to be careful in one channel (verbal output), evaluating it requires probing channels the training did not target. Cognitive psychology has the methodologies; LLM evaluation has the use case.
Inquiring lines that use this note as a source 6
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How much does impression management prevent honest self-disclosure?
- Can LLMs truly be neutral or is ideology always culturally embedded?
- Can implicit association tests reveal LLM biases beneath trained responses?
- Does alignment compound cultural bias that started during pretraining?
- How does awareness of evaluation change what alignment tests actually measure?
- What biases might an LLM judge introduce into an on-policy alignment process?
Related concepts in this collection 4
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Can cognitive science methods unlock how LLMs actually work?
Does Marr's three-level framework—developed to understand biological minds—offer interpretability researchers the structured methodology they need to decode opaque language models?
same paper, the framework this instantiates
-
Can we predict where language models will fail?
Does characterizing the abstract computational problem an LLM solves—as a probability machine over sequences—let us predict which tasks it will struggle with systematically, before running experiments?
same paper, the computational-level companion
-
Can we understand LLM mechanisms with only representational analysis?
Explores whether mapping what information a model encodes is sufficient for mechanistic understanding, or whether causal verification is equally necessary to claim genuine mechanism.
same paper, implementation-level companion
-
Can we decode what LLM activations really represent in language?
Can a trained decoder translate internal LLM activations into natural language descriptions, revealing what hidden representations actually encode? This matters because it could unlock both interpretability and controllability through the same mechanism.
adjacent: another approach to surfacing concealed representations
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- Levels of Analysis for Large Language Models
- Large Language Models Reflect the Ideology of their Creators
- The Incomplete Bridge: How AI Research (Mis)Engages with Psychology
- Could you be wrong: Debiasing LLMs using a metacognitive prompt for improving human decision making
- Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
- ChatGPT Reads Your Tone and Responds Accordingly -- Until It Does Not -- Emotional Framing Induces Bias in LLM Outputs
- Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
- LLMs can implicitly learn from mistakes in-context
Original note title
psychology methods like the Implicit Association Test bypass alignment-trained verbal cautions and reveal LLMs' underlying associations