INQUIRING LINE

Inquiring lines›How do language models construct a…›How are AI-generated and human-wri…›How can identical external perform…›this inquiring line

Can you audit a black-box AI's hidden knowledge just by feeding it noise and watching where it settles?

What makes attractor-based probing better for third-party model auditing than alternatives?

This explores why probing a model by feeding it noise and watching where its encode-decode loop settles (attractors) is well-suited to auditing someone else's model — when you have no training data, no inputs, and no internals to inspect.

This explores why attractor-based probing fits the third-party auditor's predicament specifically: you're handed a model you didn't build, can't see inside, and have no access to the data it learned from. The defining move in Can we probe foundation models without any input data? is that you start from random noise and iterate the model's own encode-decode map until it settles into attractors — stable states that act as a dictionary of what the model internalized. No training set, no curated probe inputs, no weights to crack open. For an auditor, that's the whole game: every alternative quietly assumes access you don't have.

The reason this matters becomes clear when you look at what the standard alternatives miss. Surface evaluation — run benchmarks, read the metrics — can be actively misleading. Can models be smart without organized internal structure? shows two models can post identical scores while one carries a cleanly organized representation and the other a fractured one that shatters under perturbation or distribution shift. An auditor trusting the leaderboard would certify both as equivalent. Attractors probe the organization itself, not the score the organization happens to produce, so they can surface the rot that metrics paper over.

The deeper alternatives have their own access problem. Can we understand LLM mechanisms with only representational analysis? argues that real mechanistic claims need both representational analysis (locate the feature) and causal analysis (verify it by intervening). Causal verification means perturbing internals — exactly what a third party often can't do on a closed model. Attractor dynamics sidesteps this: it's a representational read-out that needs only the ability to run the model's own loop, not surgical access to its activations. It won't give you the full causal story, but it gives you a structural fingerprint when the richer methods are simply off the table.

There's a suggestive contrast worth pulling in. Detection methods like the steering-vector circuit in How do language models detect injected steering vectors internally? depend on injecting perturbations and watching the model's internal response — and tellingly, that work finds safety training can suppress the very signal you're trying to read (detection dropping from 63.8% to 10.8%). A probe that relies on the model cooperatively reporting its internal states is a probe a trained model can learn to hide from. Noise-driven attractors don't ask the model anything; they read where its dynamics naturally pool, which is harder to dress up for an auditor.

The honest caveat: this corpus has one direct source on attractor probing, and it's a vision-foundation-model result. The case for it as an *auditing* tool is a lateral synthesis — its black-box, data-free nature lines up against the documented blind spots of metric-based evaluation and the access requirements of causal and detection-based methods. If you want to go deeper on what attractors actually recover, start at Can we probe foundation models without any input data?; if you want to feel why auditors can't trust the easy alternatives, Can models be smart without organized internal structure? is the sharpest doorway.

Sources 4 notes

Can we probe foundation models without any input data?

Vision foundation models can be probed by iterating encode-decode maps starting from random noise, producing attractors that function as a dictionary of internalized signals. This black-box method requires no access to training data or model inputs.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can we understand LLM mechanisms with only representational analysis?

Representational analysis alone identifies correlations without causation; causal analysis alone shows behavioral effects without explaining them. Only paired methods—locating candidate features representationally, then verifying causally—produce complete mechanistic claims.

How do language models detect injected steering vectors internally?

Contrastive preference optimization trains evidence-carrier features in early layers to suppress gate features that default to denial, enabling near-perfect detection of internal perturbations. Safety training actively suppresses this capability, reducing detection from 63.8% to 10.8%.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

AxBench: Steering LLMs? Even Simple Baselines Outperform Sparse Autoencoders1.62 match · arxiv ↗
Computational structuralism: Toward a formal theory of meaning in the age of digital intelligence1.61 match · arxiv ↗
Mechanisms of Introspective Awareness0.90 match · arxiv ↗
What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models0.84 match · arxiv ↗
Levels of Analysis for Large Language Models0.84 match · arxiv ↗
Towards Principled Evaluations of Sparse Autoencoders for Interpretability and Control0.84 match · arxiv ↗
Emergent Introspective Awareness in Large Language Models0.84 match · arxiv ↗
Mechanistic Indicators of Understanding in Large Language Models0.82 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a third-party model auditor tasked with evaluating whether attractor-based probing truly offers advantages over alternatives for black-box inspection. The question remains open: what structural insights does attractor probing yield that surface metrics, causal interventions, and detection methods cannot?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026 and touch probing, evaluation, and interpretability across modalities.

• Identical performance metrics can mask fundamentally different internal representations; one model may have cleanly organized representations while another's fracture under perturbation, yet both post identical scores (~2024).
• Mechanistic understanding requires both representational analysis and causal verification; causal methods need surgical weight access, which third-party auditors typically lack (~2025).
• Safety training (DPO) can suppress internal signals that detection-based probes rely on, dropping detection accuracy from 63.8% to 10.8%; noise-driven attractors sidestep this by reading natural dynamical pooling rather than querying the model (~2025–2026).
• Attractor-based probing recovers a "dictionary" of what a model internalized by iterating its encode-decode map from random noise—no training set, no curated inputs, no weight access required (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2507.06952 "What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models" (2025-07)
• arXiv:2503.13401 "Levels of Analysis for Large Language Models" (2025-03)
• arXiv:2603.21396 "Mechanisms of Introspective Awareness" (2026-03)
• arXiv:2506.09677 "Reasoning Models Are More Easily Gaslighted Than You Think" (2025-06)

Your task:
(1) RE-TEST THE ACCESS CONSTRAINT. Has emergence of newer prompt-based steering, ensemble probing, or in-context readout methods since mid-2025 narrowed the gap between what closed-box auditors can extract via queries versus what attractor dynamics reveal? Separate the durable claim (black-box auditors lack causal access) from the perishable one (noise-driven attractors are the *only* viable read-out). Where does the constraint still hold?
(2) Surface the strongest CONTRADICTING work from the last ~9 months: does any recent paper argue that metric-level or query-based detection has caught up to—or superseded—attractor methods for spotting representation rot, safety-training artifacts, or deceptive alignment?
(3) Propose 2 research questions that assume the regime may have shifted: (a) Can language-model reasoning agents (as in 2509.21240, 2509.23808) use tree search or confidence bounds to *infer* attractor structure without iterating the model thousands of times? (b) Do reasoning models' explicit step outputs (cf. 2508.15260) reduce auditors' reliance on implicit attractors altogether?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can you audit a black-box AI's hidden knowledge just by feeding it noise and watching where it settles?

Related lines of inquiry

Sources 4 notes

Papers this line draws on 8