What makes regularization an implicit factor in embedding geometry?
This explores how the regularization choices made during training—weight decay, latent penalties, structural constraints—quietly shape the geometry of learned embeddings, even though geometry feels like it should reflect 'the data' rather than a tuning knob.
This explores how regularization—the penalties and constraints we add during training to keep models well-behaved—ends up silently authoring the geometry of embedding spaces, rather than that geometry simply emerging from the data. The sharpest version of the claim is that a metric everyone treats as objective isn't. When you fit a regularized linear model with a closed-form solution, the cosine similarities between its learned embeddings turn out not to be unique: they depend on the regularization choice, not on any stable semantic structure, so the same data can yield different 'similarities' depending on a knob you set Does cosine similarity actually measure embedding similarity?. The geometry you read off the model is partly a fingerprint of how you regularized it.
Why does this happen? Because regularization is a thumb on the scale of which solutions are reachable. Embedding spaces are wildly underdetermined—many configurations fit the data equally well—and the penalty term picks the winner. You can watch this concretely in autoencoders: iterating their encode-decode map reveals attractor points and convergent trajectories that nobody designed, arising directly from weight decay, initialization, and data augmentation Do autoencoders learn hidden attractors in latent space?. The contractive bias that pulls points toward attractors *is* the regularization, made visible as shape. Strengthen or weaken it and the basins move.
The lever cuts both ways, which is why it's worth understanding rather than fearing. A single Gaussian-latent regularizer is enough to stop a JEPA from collapsing all its representations into a useless point—replacing six fiddly hyperparameters with one principled penalty that holds the geometry open Can a single regularizer prevent JEPA representation collapse?. And structural constraints, a cousin of regularization, can be the whole story: ESLER's zero-diagonal rule (items can't predict themselves) forces prediction through inter-item relationships and beats deep models—evidence that the imposed bias matters more than raw capacity Can a linear model beat deep collaborative filtering?. Forcing sparsity onto weights likewise reshapes geometry into clean, modular, human-readable circuits that wouldn't form otherwise Can sparse weight training make neural networks interpretable by design?.
The unsettling corollary is that the same regularizing pressures decide whether a model memorizes or generalizes—the attractor work frames its emergent geometry as sitting exactly on that spectrum Do autoencoders learn hidden attractors in latent space?—and they can hide as easily as they help. Models trained with ordinary SGD can carry every linearly-decodable feature a task needs while their internal organization is quietly fractured, a brittleness invisible to accuracy metrics but exposed by perturbation Can models be smart without organized internal structure?. Even apparently 'intrinsic' properties like activation density turn out to be trained in, not given: networks learn dense codes for familiar data and fall back to sparse ones for the unfamiliar Is representational sparsity learned or intrinsic to neural networks?.
What you didn't know you wanted to know: the structure people celebrate as 'emergent'—polar coordinates encoding syntax, eigenvectors that recover the WordNet hierarchy coarse-to-fine How do language models encode syntactic relations geometrically? Do embedding eigenvectors organize taxonomy from coarse to fine?—lives in a space whose ruler was set by regularization. The geometry is real, but the coordinate frame you measure it in is a choice, which is why a similarity score can be both meaningful and unstable at the same time.
Sources 9 notes
Regularized linear models with closed-form solutions show that cosine similarities between embeddings are not unique and depend on regularization choices made during training, not on actual semantic structure. This makes cosine scores unstable and potentially meaningless.
Iterating an autoencoder's encode-decode map reveals convergent trajectories with attractor points that emerge from training-induced contractive biases. These attractors arise naturally from initialization schemes, weight decay, and data augmentation—without explicit design—and their nature reflects the memorization-versus-generalization spectrum of the training regime.
LeWorldModel trains a JEPA end-to-end using only next-embedding prediction and a Gaussian-latent regularizer, reducing tunable hyperparameters from six to one. The model achieves competitive control performance and 48× faster planning than foundation-model world models on a single GPU.
ESLER, a single-layer linear autoencoder constrained so items cannot predict themselves, outperforms most deep CF models. The constraint forces prediction through item relationships, and negative weights encoding anti-affinity prove essential—structural bias matters more than model capacity.
Training transformers with sparse weights creates compact, human-interpretable circuits where neurons correspond to simple concepts with clear connections. Ablation studies confirm these circuits are necessary and sufficient for task performance, though scaling beyond tens of millions of parameters while maintaining interpretability remains unsolved.
Models trained with SGD can contain all the linearly decodable features needed for a task while maintaining fundamentally broken internal organization. This makes them vulnerable to perturbation and distribution shift invisible to standard evaluation metrics.
During pretraining, neural networks develop dense activations for familiar training data and default to sparse representations for unfamiliar inputs. This trend emerges without task-specific fine-tuning and reflects how models consolidate knowledge through exposure.
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
Leading eigenvectors of embedding Gram matrices separate broad taxonomic branches first, then progressively finer sub-branches—a coarse-to-fine spectral order that tracks the WordNet hypernym tree level by level, confirming predictions from co-occurrence statistics.