INQUIRING LINE

Inquiring lines›How do language models construct a…›How does AI persuasion undermine h…›Why do continual learning scenario…›this inquiring line

The depth of what an AI 'knows' internally isn't designed in — it's a footprint of its training diet.

Does latent density emerge during pretraining from training data familiarity?

This explores whether the dense vs. sparse activation patterns inside a model are something it *learns* during pretraining based on how familiar the data is — rather than being a fixed property of the architecture.

This explores whether the dense vs. sparse activation patterns inside a model are learned from data familiarity during pretraining, and the corpus answers fairly directly: yes. The clearest evidence is that neural networks develop dense activations for data they've seen a lot of and fall back to sparse representations for unfamiliar inputs — and this split emerges purely from pretraining exposure, before any task-specific fine-tuning Is representational sparsity learned or intrinsic to neural networks?. Density isn't baked into the network; it's a trace of what the model got comfortable with.

What makes this interesting is how many *other* behaviors turn out to be governed by the same familiarity logic. The strength of a concept's 'priming' after a few gradient updates is predictable from how probable its keywords were *before* learning, with a sharp threshold separating things that stick from things that don't — just three exposures can lock it in Can we predict keyword priming before learning happens?. Hallucination risk follows the same shape from the other side: models go wrong not when they're under-confident, but when they hit entity *combinations* they never saw co-occur in training, so pretraining co-occurrence statistics predict failure better than the model's own confidence does Can pretraining data statistics detect hallucinations better than model confidence?. Familiar territory → dense, confident, primed; unfamiliar territory → sparse, brittle, hallucination-prone. It's the same gradient seen through different instruments.

The deeper claim across these notes is that pretraining is where the durable structure gets *planted*, and later training only selects or nudges it. Cognitive biases, for instance, are causally traced to the pretrained backbone — models sharing a backbone show the same biases regardless of what instruction data you fine-tune on; fine-tuning only modulates Where do cognitive biases in language models come from?. Reasoning ability tells the same story: five independent methods all *elicit* reasoning already latent in base-model activations rather than installing it, so the bottleneck is elicitation, not acquisition Do base models already contain hidden reasoning ability?. Even RL post-training mostly amplifies one format distribution that pretraining already favored while suppressing the alternatives Does RL training collapse format diversity in pretrained models?. Density-from-familiarity is one instance of a broader pattern: the model's character is laid down by exposure, and downstream training is a selector on top of it.

There's a useful counterpoint about *why* familiarity gets compressed into structure so readily. Latent-level prediction recovers compositional hierarchy exponentially faster than token-level prediction, because representations at the same level are far more correlated than raw tokens Why is predicting latents more sample-efficient than tokens? — which hints at why familiar data consolidates into dense, reusable internal structure rather than staying diffuse. And one architecture builds the iterative computation directly into pretraining latent space, suggesting density isn't just an accident of exposure but something you can deliberately shape at the pretraining stage Can reasoning be learned during pretraining rather than after?.

The thing you didn't know you wanted to know: this implies a practical asymmetry in how you'd intervene on a model. If density, biases, priming, and hallucination-proneness are all written during pretraining, then trying to fix them by fine-tuning is fighting the wrong layer — which is exactly why decoding-time approaches that leave base weights untouched preserve knowledge better than direct fine-tuning, who corrupts the knowledge stored in lower layers Can decoding-time tuning preserve knowledge better than weight fine-tuning?.

Sources 9 notes

Is representational sparsity learned or intrinsic to neural networks?

During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.

Can we predict keyword priming before learning happens?

Pre-learning keyword probability strongly predicts post-learning priming across architectures and model sizes, with a ~10^-3 threshold separating contexts where priming occurs from those where it doesn't. Just 3 training exposures suffice to establish the effect.

Can pretraining data statistics detect hallucinations better than model confidence?

QuCo-RAG uses entity co-occurrence patterns from training data to trigger retrieval, successfully flagging hallucination risk even when models are highly confident. This data-side approach catches the root cause (unseen combinations) rather than the symptom (low confidence).

Where do cognitive biases in language models come from?

A causal experiment using random-seed variation and cross-tuning showed that models sharing a pretrained backbone exhibit similar bias patterns regardless of finetuning data. Biases are planted during pretraining and merely swayed by instruction tuning.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Show all 9 sources

Does RL training collapse format diversity in pretrained models?

Controlled experiments show RL consistently amplifies one format distribution from pretraining within the first epoch while collapsing alternatives. The winning format depends on model scale, not necessarily performance, and is largely hidden when starting from proprietary pretrained models.

Why is predicting latents more sample-efficient than tokens?

A formal sample-complexity analysis proves latent-level self-supervision (data2vec/JEPA style) recovers compositional structure with samples constant in hierarchy depth, while token-level learning requires exponential samples—because same-level latents are far more correlated than raw tokens.

Can reasoning be learned during pretraining rather than after?

Ouro's 1.4B–2.6B models match 12B baselines by performing reasoning during pretraining via iterative latent loops, not by storing more knowledge. Their intermediate latent states align strongly with final outputs, making them more faithful than divergent chain-of-thought traces.

Can decoding-time tuning preserve knowledge better than weight fine-tuning?

Proxy-tuning closes 88-91% of the alignment gap while surpassing direct fine-tuning on knowledge tasks by leaving base model weights untouched. Direct fine-tuning corrupts knowledge storage in lower layers, whereas proxy-tuning applies distributional shifts that primarily affect reasoning and style.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a research analyst tasked with re-testing constraints on latent density emergence in LLMs. The question remains open: does latent density emerge during pretraining from training data familiarity, and if so, what downstream interventions does that imply?

What a curated library found — and when (dated claims, not current truth):
Findings span 2024–2026; treat as perishable snapshots:
- Dense activations emerge for high-exposure data; sparse for unfamiliar inputs, purely from pretraining (2026 analysis).
- Knowledge priming after gradient updates is predictable from pretraining keyword probability with sharp thresholds; three exposures lock concepts in (~2025).
- Hallucination risk correlates with unobserved entity co-occurrences in training, not model confidence (2024–2025).
- Cognitive biases and reasoning ability are latent in pretrained backbones; fine-tuning selects rather than installs them (2025–2026).
- RL post-training amplifies one pretraining distribution while suppressing alternatives; decoding-time intervention preserves base-layer knowledge better than direct fine-tuning (2025).

Anchor papers (verify; mind their dates):
- arXiv:2605.27734 (2026-05): Latent-level prediction is exponentially more sample-efficient than token prediction.
- arXiv:2507.07186 (2025-07): Cognitive biases traced to pretraining backbone, invariant to fine-tuning.
- arXiv:2504.07912 (2025-04): RL post-training amplifies pretraining distributions.
- arXiv:2603.03415 (2026-03): OOD representation sparsity.

Your task:
(1) RE-TEST EACH CONSTRAINT. For every claim above, determine whether newer models (GPT-4.5+, o1/o3 variants), architectural innovations (MoE scaling, looped iterative latents), training methods (curriculum, multi-task pretraining), or runtime tooling (adaptive computation, speculative sampling) have relaxed or overturned it. Separate the durable principle (familiarity → consolidation?) from the perishable mechanism (density as sparsity measure). Where a constraint still holds, cite evidence; where it has shifted, name what shifted it.
(2) Surface the strongest contradicting or superseding work from the last ~6 months—especially any that show fine-tuning or RL successfully *remolds* pretrained structure, or that decouple density from familiarity.
(3) Propose 2 research questions that assume the regime may have evolved: e.g., "Can curriculum pretraining deliberately *structure* density orthogonal to data frequency?" or "Do multi-modal or multi-lingual pretrains break the familiarity→density link?"

Cite arXiv IDs; flag anything you cannot ground in a real paper.

The depth of what an AI 'knows' internally isn't designed in — it's a footprint of its training diet.

Related lines of inquiry

Sources 9 notes

Papers this line draws on 8