Investigating task-specific prompts and sparse autoencoders for activation monitoring

Paper · arXiv 2504.20271 · Published April 28, 2025

Language models can behave in unexpected and unsafe ways, and so it is valuable to monitor their outputs. Internal activations of language models encode additional information that could be useful for this. The baseline approach for activation monitoring is some variation of linear probing on a particular layer: starting from a labeled dataset, train a logistic regression classifier on that layer’s activations. Recent work has proposed several approaches which may improve on naive linear probing, by leveraging additional computation. One class of techniques, which we call “prompted probing,” leverages test time computation to improve monitoring by (1) prompting the model with a description of the monitoring task, and (2) applying a learned linear probe to resulting activations. Another class of techniques uses computation at train time: training sparse autoencoders offline to identify an interpretable basis for the activations, and e.g. max-pooling activations across tokens using that basis before applying a linear probe. However, one can also prompt the model with a description of the monitoring task and use its output directly. We develop and test novel refinements of these methods and compare them against each other.

Introduction. To ensure the safety of AI systems in practice, other AI systems are often used to monitor their outputs. Two common approaches are (1) prompting large language models (LLMs) with a natural language monitoring task, and using natural language output as the monitor signal; and (2) probing LLM hidden layer activations using logistic regression classifiers, and using the classifier output as the monitor signal. The goal of this work is to ground these two approaches, prompting and probing, by comparing variants (and combinations) of both that have been proposed in the literature. We pay particular attention to probing variants involving sparse autoencoders (SAEs) which have been proposed as a way to leverage train time compute to improve probe performance. These approaches have various limitations in resource-constrained settings. For example, prompting the model does not require labeled training data for the monitoring task (a model can often perform the task zero-shot).

Discussion / Conclusion. Previous work has shown mixed results for the efficacy of SAE-based approaches compared with probes directly on activations (Bricken et al. (2024), Gallifant et al. (2025), Kantamneni et al. (2025), Smith et al. (2025)). In particular, Kantamneni et al. (2025) and Smith et al. (2025) found that SAEbased approaches rarely exceeded the best non-SAE-based approach they tried, across a diversity of datasets. Bricken et al. (2024), Kantamneni et al. (2025), and Smith et al. (2025) reported that SAE-based approaches generalized worse than non-SAE-based approaches. Our work here is broadly consistent with these findings, though we note that we find more favorable performance overall for SAE-based approaches than Kantamneni et al. (2025) and Smith et al. (2025) on the datasets we examined. We find prompted probing, which was not examined in these other works, to be highly competitive in terms of both data efficiency and generalization performance.

Investigating task-specific prompts and sparse autoencoders for activation monitoring

Synthesis notes from this paper's topics