INQUIRING LINE

Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scaling · Model Architecture and Internalscross-cluster

What makes passive prompt transfer fail as a substitute for auditable expertise?

This explores why dropping expertise into a prompt and hoping it transfers can't stand in for knowledge that's versioned, inspectable, and correctable — and what the corpus says breaks when you try.

This reads 'passive prompt transfer' as the move of packing expertise into a prompt and expecting it to carry over — and 'auditable expertise' as knowledge you can actually inspect, correct, and roll back. The corpus suggests three distinct failure points, and they compound. The first is a hard ceiling: prompting only works inside what a model already learned. Prompt optimization can reorganize and surface latent knowledge, but it cannot inject anything genuinely absent from training Can prompt optimization teach models knowledge they lack?. So a prompt that 'contains' expertise the model never internalized is borrowing against an account that may be empty — and you won't see the overdraft until it matters.

The second failure is that even when the knowledge is there, passive prompting is unstable in a way that defeats auditing. Outputs swing with rephrasing, and that fragility tracks the model's underlying confidence: low-confidence answers flip under trivial prompt variation Does model confidence predict robustness to prompt changes?. Expertise you can't reproduce reliably isn't expertise you can audit — there's no stable artifact to point at and say 'this is what it knows.'

The third, and most direct answer to the question, is governance. Real auditability needs a file-level lifecycle — versioning, inspection, correction, rollback — rather than expertise smuggled in as hidden prompt state. The COLLEAGUE.SKILL framing makes this explicit by separating what someone knows from how they behave, so each can be reviewed independently Can person-grounded skills remain auditable without hidden prompt state?. A prompt blob is the opposite: opaque, unversioned, and entangled. The same logic shows up in how reasoning data is reused — the reusable unit isn't a prompt-response pair but a whole feedback interface bound to its verifier, lineage, and scaffold; attribution only becomes tractable when those are released together What is the actual reusable unit of reasoning data?.

There's a cross-domain echo worth noticing: in agents, passive compression of context loses to actively delegating and verifying. Training a model to dispatch subtasks and integrate checked results teaches a transferable discipline of decomposition and evidence-grounding that passive summarization never builds Can delegation teach models to manage context more actively?. And reliability in long reasoning traces comes from verifying the process — checking intermediate states — not from trusting a final output Where do reasoning agents actually fail during long traces?. Both point the same way: trust is earned by something you can check along the way, not by a payload you hand over and hope holds.

The thing you didn't know you wanted to know: 'auditable' and 'transferable' turn out to be the same property viewed from two angles. Expertise becomes portable precisely when it's been externalized into inspectable, verifier-bearing artifacts — and a prompt fails as a substitute not because it's too short, but because it hides the very structure that auditing and reliable transfer both depend on.

Sources 6 notes

Can prompt optimization teach models knowledge they lack?

Prompting works entirely within a model's pre-existing training distribution and cannot supply domain knowledge absent from training data. This creates a hard ceiling: no prompt strategy can compensate for missing foundational knowledge, only reorganize what already exists.

Does model confidence predict robustness to prompt changes?

ProSA found that when models are highly confident, they resist prompt rephrasing; low confidence causes major output swings. Larger models, few-shot examples, and objective tasks all correlate with higher confidence and greater robustness.

Can person-grounded skills remain auditable without hidden prompt state?

COLLEAGUE.SKILL treats distilled expertise as versioned files subject to inspection, correction, and rollback—not hidden prompt state. Separating capability tracks from behavior tracks enables independent audit of what someone knows versus how they act.

What is the actual reusable unit of reasoning data?

The reusable unit in post-training is a feedback interface entangled with six factors: verifier, base model, lineage, optimizer, scaffold, and budget. Changing any one alters the same data's effect, making attribution tractable only when these are jointly released.

Can delegation teach models to manage context more actively?

SearchSwarm shows that training models to delegate subtasks and integrate summarized results beats passive compression, with a 30B model matching much larger ones. Critically, the delegation skill transfers to single-agent tasks, suggesting it teaches disciplined decomposition and evidence grounding, not just orchestration.

Where do reasoning agents actually fail during long traces?

Reliability for long-trace reasoning comes from checking intermediate states and policy compliance during generation, not from scoring final outputs. Adding intermediate verification raised task success from 32% to 87% because most failures are process violations, not wrong answers.

What makes passive prompt transfer fail as a substitute for auditable expertise?

Sources 6 notes

Next inquiring lines