INQUIRING LINE

Inquiring lines›How do language models construct a…›How do dialogue systems achieve ge…›Do language model representations…›this inquiring line

Steering an AI's internal concepts is predictable — but because everything's entangled, you always move more than you aimed at.

Do all semantic steering effects follow predictable patterns based on feature alignment?

This explores whether steering an LLM's internal features produces tidy, predictable results — and the corpus suggests predictability is real but governed by several forces at once, not feature alignment alone.

This reads the question as asking whether you can intervene on one semantic feature and reliably forecast the outcome from how features line up. The cleanest 'yes' comes from work showing that meaning in embeddings isn't stored in tidy independent slots: twenty-eight semantic axes collapse into just three human-like evaluation dimensions, so nudging one feature predictably drags its neighbors along in proportion Do LLM semantic features organize along human evaluation dimensions?. That proportionality is exactly the 'predictable pattern based on feature alignment' the question names — but the same finding is also the catch: the off-target spillover is *unavoidable*, baked into how meaning is organized, not a bug you can steer around. So even where steering is predictable, it's predictably *messy*.

Widen the lens and you find that alignment-geometry is only one of several knobs that decide whether an intervention lands. Pre-learning probability is another: keyword priming after a gradient update is forecastable from the word's probability *before* training, with a sharp ~10⁻³ threshold below which priming simply doesn't happen and just three exposures above it Can we predict keyword priming before learning happens?. Frequency adds a third, directional bias — because general concepts (hypernyms) appear more often than specific ones, any push toward 'common' phrasing systematically drifts meaning toward abstraction and erases expert specificity Does word frequency correlate with semantic abstraction?. None of these are about feature alignment per se; they're about probability mass and training statistics, and they shape steering outcomes just as much.

The biggest break from clean predictability is resistance. Strong parametric priors can flat-out override what you try to inject: models generate outputs inconsistent with their own context when training associations dominate, and textual prompting alone can't dislodge them — you need causal intervention in the representations themselves Why do language models ignore information in their context?. So whether a steering signal even registers depends on what the model already 'believes,' which varies by case rather than following one law.

There's also a story about *where* the steering signal concentrates. In reasoning training, only about 20% of tokens — the high-entropy 'forking points' — actually carry the learning signal, and adjusting just those matches full updates Do high-entropy tokens drive reasoning model improvements?. That's predictability of a different kind: not 'aligned features move together' but 'a minority of decision points absorb the effect.' And the level you intervene at changes the footprint entirely — decoding-time proxy-tuning shifts style and reasoning while leaving knowledge in the lower layers intact, whereas direct fine-tuning corrupts that same storage Can decoding-time tuning preserve knowledge better than weight fine-tuning?.

So the honest answer is: no, not from feature alignment alone. Alignment geometry makes *spillover* predictable, but whether a steering effect appears, how strong it is, which direction it drifts, and what collateral it causes are jointly set by prior probability, frequency structure, the strength of competing parametric associations, and the layer you touch. Predictability exists — it's just multi-causal, and 'feature alignment' is one cause among several.

Sources 6 notes

Do LLM semantic features organize along human evaluation dimensions?

Twenty-eight semantic axes in LLM embeddings reduce to three principal components matching human EPA structure. Intervening on one feature predictably shifts aligned features proportionally, creating unavoidable off-target effects that reflect how meaning is fundamentally organized.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Does word frequency correlate with semantic abstraction?

WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Show all 6 sources

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

How new data permeates LLM knowledge and how to dilute it2.57 match · arxiv ↗
Semantic Structure in Large Language Model Embeddings1.68 match · arxiv ↗
Language models show human-like content effects on reasoning tasks1.65 match · arxiv ↗
From Tokens to Thoughts: How LLMs and Humans Trade Compression for Meaning1.65 match · arxiv ↗
Between Circuits and Chomsky: Pre-pretraining on Formal Languages Imparts Linguistic Biases1.65 match · arxiv ↗
Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations?1.64 match · arxiv ↗
Echo Chamber: RL Post-training Amplifies Behaviors Learned in Pretraining1.60 match · arxiv ↗
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning0.92 match · arxiv ↗

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a mechanistic interpretability researcher. The question remains open: Do all semantic steering effects follow predictable patterns based on feature alignment?

What a curated library found — and when (dated claims, not current truth):
Findings span 2022–2025. A library of recent work suggests:
• Semantic axes in embeddings are entangled in low-dimensional structure (~28 axes collapse to 3 evaluation dimensions), making spillover predictable but unavoidable (~2025).
• Pre-learning keyword probability is a sharp threshold (~10⁻³) below which priming doesn't occur; only ~3 exposures above threshold trigger learning (~2025).
• Frequency structure systematically biases steering toward hypernyms (common) over hyponyms (specific), erasing expert detail (~2025).
• Strong parametric priors can override steering signals entirely; textual prompting alone fails; causal intervention in representations needed (~2024).
• Only ~20% of tokens (high-entropy forking points) carry the learning signal in reasoning; layer of intervention changes the collateral damage profile (~2025).

Anchor papers (verify; mind their dates):
• arXiv:2508.10003 (2025-08) — Semantic Structure in LLM Embeddings
• arXiv:2505.21011 (2025-05) — LLMs are Frequency Pattern Learners
• arXiv:2506.01939 (2025-06) — High-Entropy Minority Tokens Drive RL
• arXiv:2404.03820 (2024-04) — Aligning LMs to Stay on Topic

Your task:
(1) RE-TEST EACH CONSTRAINT. For feature alignment per se, has newer work (last 6 months) shown ways to steer *without* entanglement spillover, or to predict spillover from alignment geometry alone? For the probability and frequency thresholds, do scaling laws, new architectures, or training methods relax these bounds? For parametric priors, does any recent causal steering technique now override strong priors with prompting alone? Separate durable (multi-causality of steering) from perishable (specific thresholds, layer sensitivity).
(2) Surface the strongest work from the last ~6 months that contradicts the claim that feature alignment is necessary/sufficient for predictability—or that shows alignment *is* sufficient under some regime.
(3) Propose 2 research questions assuming the regime has shifted: (a) Can steering be made alignment-independent by working entirely in causal graphs or hidden state dynamics? (b) Do emergent abilities (e.g., reasoning at scale) require *new* alignment geometries, and do old predictability patterns break?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Steering an AI's internal concepts is predictable — but because everything's entangled, you always move more than you aimed at.

Related lines of inquiry

Sources 6 notes

Papers this line draws on 8