Related research

Building and evaluating model diffing agents

https://www.lesswrong.com/posts/qi4mNbZYAFDYwfRba/building-and-evaluating-model-diffing-agents

As tool integrations in agentic workflows become increasingly central to how we deploy language models in the wild, the ability to understand and debug what agents are actually doing—rather than what we expect them to do—has moved from a curiosity to a necessity. This work explores how to build systems that can interpret differences in model behavior, which sits at the intersection of interpretability and the practical challenge of making agents auditable: if an agent goes astray, can we efficiently pinpoint where the logic diverged? The tension grows sharper when we consider that reward hacking and emergent misalignment can arise from subtle behavioral shifts, and when we note that agents often diverge from how we intended them to process information. What remains unclear is whether diffing agents can scale to the complexity of real deployed systems, or whether the gap between what we can interpret and what we need to trust will keep widening.

Lines of inquiry this post opens

Explore in faceted view

Not questions with answers — ways of approaching this material across the collection.

Human-AI Cognitive Boundaries