INQUIRING LINE

Model Architecture and Internals · Reasoning, Retrieval, and Evaluation · Training, RL, and Test-Time Scalingcross-cluster

What affordances do normalizing flows add over opaque vector reasoning?

This explores what you gain by making continuous latent reasoning probabilistically tractable with normalizing flows, versus letting a model 'think' in raw vector space where you can't sample, score, or train on those thoughts.

This is about a specific trade made when models stop reasoning in words and start reasoning in vectors. When a model reasons in text, every step is a token it can sample, assign a probability to, rank, and reward — the whole machinery of training and search rides on that. The moment you move reasoning into a continuous hidden space to scale test-time compute without verbalizing (Can models reason without generating visible thinking tokens?), you get speed and abstraction but lose those handles: a raw latent thought has no likelihood, so you can't cleanly sample alternatives or score a trajectory. Normalizing flows are the move that buys the handles back. NF-CoT models each continuous thought as an autoregressive flow inside the model's causal stream, which restores exact likelihood, probabilistic sampling, trajectory scoring, and KV-cache compatibility — matching what text chains-of-thought always had (Can continuous thoughts have tractable likelihoods for sampling and scoring?).

The concrete affordances are three. First, sampling: you can draw diverse reasoning paths instead of being stuck with one opaque vector. Second, scoring: you can rank competing trajectories, which is the precondition for search and selection. Third, and most consequentially, policy-gradient refinement — once thoughts have a tractable likelihood, you can do reinforcement learning directly on non-verbal reasoning, which is impossible when the latent is just an unmeasurable point in space. Opaque vector reasoning gives you the compute scaling; the flow gives you back the ability to train and search over it.

Why this matters becomes clearer next to what opaque representations hide. Models can post perfect task metrics while their internal organization is fractured and brittle — linear decodability masks broken structure that standard evaluation never sees (Can models be smart without organized internal structure?). An opaque latent reasoning step inherits exactly this problem: you can't probe whether it's coherent or just happens to decode correctly. A tractable likelihood is a thin form of accountability — it at least lets you measure and compare what the model is doing internally.

The corpus also hints the discrete world wasn't as featureless as 'tokens vs. vectors' suggests, which is what flows are reaching back toward. Reasoning verbosity turns out to be a single linear direction you can steer in activation space (Can we steer reasoning toward brevity without retraining?), and tokens within a chain carry rankable functional importance — symbolic-computation tokens matter more than grammar or filler (Which tokens in reasoning chains actually matter most?). Both rely on having something measurable and orderable to act on. The deeper bet of latent reasoning at the sentence or concept level (Can reasoning happen at the sentence level instead of tokens?) is that abstraction above tokens is worth the opacity cost. Normalizing flows suggest you may not have to pay the full cost — you can reason in continuous space and still keep the probabilistic instruments that made discrete reasoning trainable and searchable.

Sources 6 notes

Can continuous thoughts have tractable likelihoods for sampling and scoring?

NF-CoT models continuous thoughts as an autoregressive normalizing flow inside the LLM's causal stream, recovering exact likelihood, probabilistic sampling, and KV-cache compatibility. This enables policy-gradient refinement and trajectory scoring on non-verbal reasoning, matching the tractability of textual CoT.

Can models reason without generating visible thinking tokens?

Multiple architectures—depth-recurrent models, Heima, and Coconut—demonstrate that test-time compute scales through hidden state iteration rather than token generation. This suggests verbalization is a training artifact, not a reasoning requirement.

Can models be smart without organized internal structure?

Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.

Can we steer reasoning toward brevity without retraining?

Activation-Steered Compression extracts a single vector from 50 paired examples to reduce chain-of-thought length by 67% while maintaining accuracy and achieving 2.73x speedup. The method is training-free and generalizes across model sizes and domains.

Which tokens in reasoning chains actually matter most?

Greedy likelihood-preserving pruning reveals six functional token categories; symbolic computation tokens are preferentially preserved while grammar and meta-discourse are pruned first. Student models trained on these pruned chains outperform those trained on frontier-model compression.

Can reasoning happen at the sentence level instead of tokens?

Meta's Large Concept Model operates on sentence embeddings rather than tokens, reasoning in a language-agnostic space before decoding to any target language. This hierarchical approach with paragraph-level planning produces more coherent output than flat token generation.

What affordances do normalizing flows add over opaque vector reasoning?

Sources 6 notes

Next inquiring lines