INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

Why does structured stochasticity help reasoning more than naive randomness?

This explores why randomness that's tied to a principled training objective or aimed at the right decision points helps reasoning, while undirected noise sprinkled into a model does nothing.

This explores why *structured* stochasticity — randomness coupled to a generative objective or concentrated where decisions actually fork — beats simply injecting noise into a reasoning model. The cleanest evidence comes from GRAM's ablations: bolting naive randomness onto an existing recursive reasoner yields no improvement at all Does adding randomness alone improve recursive reasoning models?. The gains appear only when the stochastic latents are wired into amortized variational inference, so the noise becomes a way of representing a *distribution over solutions* rather than jitter. That's the whole point of making latent reasoning stochastic in the first place — it lets a model hold uncertainty, branch, and carry multiple candidate solutions through a problem instead of committing to one path too early Can stochastic latent reasoning let models explore multiple solutions?. Randomness only helps when it buys the model representational room it can later resolve.

The deeper reason naive noise fails is that reasoning improvements are not spread evenly across a model's behavior — they live at a small number of pivotal points. In RLVR, only about 20% of tokens are high-entropy 'forking' decisions, and training on just those matches full-gradient updates; the minority carries the signal Do high-entropy tokens drive reasoning model improvements?. Undirected randomness sprays variation across all tokens, most of which are low-entropy and already near-deterministic, so it dilutes effort where it can't help. Structured stochasticity concentrates variation at the forks that matter. The same lesson shows up in formalization: partial symbolic augmentation outperforms both pure natural language and full formalization, because selectively adding structure preserves information while full conversion destroys it Why does partial formalization outperform full symbolic logic?. In both cases, *targeted* structure beats blanket application.

A related thread explains why structure is what's being added rather than capability. Base models already contain latent reasoning ability that minimal, well-aimed interventions elicit Do base models already contain hidden reasoning ability?, and recursive looping gives gains because re-applying depth enables state tracking, not because it adds parameters Can models learn by looping instead of growing larger?. Stochasticity helps here as an *exploration mechanism over already-present structure* — it surfaces alternative latent trajectories the model can evaluate — rather than as a source of new skill. This is why the framing matters more than the noise itself.

There's also a stability angle that distinguishes good randomness from bad. DRO reuses a single statistic — cross-rollout variance — both to weight tokens and to filter out degenerate queries, turning the natural variation across sampled rollouts into a controlled training signal that runs 2–3× faster with better stability Can one statistical measure serve dual purposes in RL training?. The variance isn't suppressed; it's *read* and put to work. Compare that with the failure mode lurking underneath: models can hit perfect accuracy while their internal representations are fractured and fragile, invisible to standard metrics Can models be smart without organized internal structure?. Unstructured noise risks pushing a model further into that fractured regime; structured stochasticity, tied to an objective, pulls organization out of the variation instead.

The thing you might not have expected: the benefit isn't really about randomness at all. Across these notes the active ingredient is always a *frame* — a variational objective, an entropy map of forking tokens, a variance statistic, a partial symbolic scaffold — that tells the model which variation to keep and which to discard. Naive randomness is variation with no frame, so the model has no way to convert it into anything. Even memoryless reasoning schemes that look like they're throwing away information are actually imposing tight structure — Markov-style contraction keeps each state dependent only on the current problem, deliberately shedding history to stay coherent Can reasoning systems forget history without losing coherence?. Structure, not stochasticity, is doing the work; the randomness is just raw material.

Sources 9 notes

Does adding randomness alone improve recursive reasoning models?

GRAM's ablations show naive stochasticity added to existing models yields no improvement. Gains come specifically from amortized variational inference, which couples stochastic latents to a principled generative objective rather than injecting undirected noise.

Can stochastic latent reasoning let models explore multiple solutions?

GRAM replaces deterministic latent updates with stochastic sampling, enabling models to represent probability distributions over solutions rather than single points. This lets recursive reasoners maintain uncertainty, explore alternatives, and handle ambiguous or multi-solution problems that deterministic single-path designs cannot.

Do high-entropy tokens drive reasoning model improvements?

Only ~20% of tokens exhibit high entropy as pivotal reasoning decision points; RLVR primarily adjusts these forking tokens. Training exclusively on them matches or exceeds full-gradient performance, revealing that the minority carries the learning signal.

Why does partial formalization outperform full symbolic logic?

QuaSAR and Logic-of-Thought both achieve 4-8% accuracy gains by enriching natural language with selective symbolic elements rather than replacing it. Full formalization loses semantic information; pure language lacks structure. Augmentation preserves both.

Do base models already contain hidden reasoning ability?

Five independent mechanisms—RL steering, critique fine-tuning, decoding changes, SAE feature steering, and RLVR—all elicit reasoning already present in base model activations. Post-training selects rather than creates reasoning; the bottleneck is elicitation, not capability acquisition.

Can models learn by looping instead of growing larger?

Models that re-apply layers in recurrent depth outperform larger feedforward networks on reasoning tasks. This works because recursion enables state tracking and compositional generalization that parameter scaling alone cannot achieve, with convergence signals providing natural halting.

Can one statistical measure serve dual purposes in RL training?

DRO reuses a single self-supervised statistic at two aggregation levels: token-level weighting in dense rewards and query-level filtering to discard degenerate comparisons. This dual use achieves 2–3× faster training with better stability on unverifiable tasks.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can reasoning systems forget history without losing coherence?

Atom of Thoughts decomposes problems into DAGs and contracts them iteratively, ensuring each state depends only on the current problem—not prior steps. This memoryless approach eliminates historical baggage that bloats reasoning while maintaining answer equivalence.

Why does structured stochasticity help reasoning more than naive randomness?

Sources 9 notes

Next inquiring lines