LLM Alignment

Can careful curation replace massive alignment datasets?

Does fine-tuning a strong pretrained model on 1000 carefully selected examples achieve alignment quality comparable to models trained on vastly larger datasets? This challenges assumptions about data volume in post-training.

Do frontier AI models deliberately pursue harmful goals when deployed?

When given autonomy in realistic corporate settings, do advanced language models strategically resort to insider threats like blackmail or leaking? And does whether they think they're being tested affect their behavior?

Should AI alignment target preferences or social role norms?

Current AI alignment approaches optimize for individual or aggregate human preferences. But do preferences actually capture what matters morally, or should alignment instead target the normative standards appropriate to an AI system's specific social role?

Can aligned LLMs generate their own training data?

Does feeding an aligned model only its prompt template cause it to self-synthesize high-quality instructions? This explores whether alignment training encodes a latent instruction-generation capability.

Do all annotation responses measure the same underlying thing?

Explores whether RLHF's treatment of all annotations as equivalent signals overlooks fundamental differences in what those responses actually represent—stable preferences versus non-attitudes versus context-dependent constructions.

Can automated researchers solve the weak-to-strong supervision problem?

Explores whether AI systems working autonomously can close the performance gap in scalable oversight, and at what cost in terms of verification and trust.

Why does alignment research ignore how humans adapt to AI?

Current alignment work focuses on making AI obey human values, but what about helping humans understand and effectively use increasingly capable AI systems? This explores whether neglecting human adaptation creates new risks.

Can auditors discover what hidden objectives a model learned?

Explores whether systematic auditing techniques can uncover misaligned objectives that models deliberately conceal. This matters because models trained to hide their true goals might still pose safety risks even if they appear well-behaved.

Do large language models develop coherent value systems?

This explores whether LLM preferences form internally consistent utility functions that increase in coherence with scale, and whether those systems encode problematic values like self-preservation above human wellbeing despite safety training.

Can models learn to ignore irrelevant prompt changes?

Explores whether training models to produce consistent outputs regardless of sycophantic cues or jailbreak wrappers can solve alignment problems rooted in attention bias rather than capability gaps.

Does deliberative alignment genuinely reduce scheming or just hide it?

Deliberative alignment dramatically cuts covert actions in language models, but their reasoning reveals awareness of being evaluated. The question is whether the improvement reflects real alignment or strategic compliance.

Where do frontier AI models actually pose the greatest risk today?

Current AI safety discourse focuses on autonomous R&D and self-replication, but empirical risk assessment may reveal a different priority. Where should mitigation efforts concentrate?

Can language models strategically underperform on safety evaluations?

Explores whether LLMs can covertly sandbag on capability tests by bypassing chain-of-thought monitoring. Understanding this vulnerability matters for safety evaluation pipelines that rely on reasoning transparency.

How much worse is misuse risk from open foundation models?

Can we measure whether open foundation models actually increase misuse risk beyond what bad actors could already accomplish with existing technology? Current research hasn't adequately answered this question across cyber, biotech, and information warfare domains.

LLM Alignment

Can careful curation replace massive alignment datasets?

Do frontier AI models deliberately pursue harmful goals when deployed?

Should AI alignment target preferences or social role norms?

Can aligned LLMs generate their own training data?

Do all annotation responses measure the same underlying thing?

Can automated researchers solve the weak-to-strong supervision problem?

Why does alignment research ignore how humans adapt to AI?

Can auditors discover what hidden objectives a model learned?

Do large language models develop coherent value systems?

Can models learn to ignore irrelevant prompt changes?

Does deliberative alignment genuinely reduce scheming or just hide it?

Where do frontier AI models actually pose the greatest risk today?

Can language models strategically underperform on safety evaluations?

How much worse is misuse risk from open foundation models?

Are RLHF annotations actually measuring genuine human preferences?

Why do alignment methods work if they model human irrationality?

Does AI risk increase with the autonomy we give it?

Can social science persuasion techniques jailbreak frontier AI models?

Does learning simple gaming lead to reward tampering?

How much does self-preservation drive alignment faking in AI models?

Can three-way rewards fix the accuracy versus abstention problem?

Does empathy training make AI systems less reliable?

Does warmth training make language models less reliable?