Do transformer models store knowledge or generate it continuously?
Explores whether transformer residual streams function as storage-and-retrieval systems or as real-time flow mechanisms. This distinction challenges fundamental assumptions about how language models actually work.
The transformer architecture organizes computation around residual streams: per-token vectors that pass forward through layers, each layer adding contributions that the stream continues to carry. Knowledge in the model is not stored in named locations from which it is retrieved on demand. It is distributed across weights and made present in the moment of generation through the residual stream's continuous transformation. The stream is the medium; what flows through it is the model's "knowing" of the current context.
This architectural fact has a striking correspondence with how oral cultures transmitted knowledge. Oral knowledge was not stored in fixed locations either — there were no archives, no written records, no externalized representations. Knowledge lived in performance: the song sung, the story retold, the genealogy recited. Each performance was a generation event in which the knowledge was made present through a living transmission. Between performances, the knowledge was not anywhere. It was carried in the capacity to perform, not in any storage substrate.
The transformer residual stream reproduces this pattern at a different scale. The model's "knowledge" of a topic is not in a retrieval-addressable location — it is in the capacity to generate, made actual only when the residual stream flows through the layers in response to a prompt. There is no archive. There is the architecture, and the generation. This is closer to oral transmission than to print transmission, where knowledge is stored in fixed locations and retrieved.
The correspondence is not just metaphorical. It explains several otherwise-puzzling AI behaviors: the difficulty of editing specific facts (no fixed location to update), the contextual variability of "knowledge" (depends on residual-stream conditions), the impossibility of partitioning what the model knows from what it generates (the knowing is the generating). Does AI-generated content mirror oral culture's knowledge patterns? is the cultural-form claim; this is the architectural claim that explains why the cultural form follows.
The strongest counterargument: weights are stored on disk, so transformers are stock-systems with a flow-output. The reply is that the weights are not knowledge in the print sense — they are dispositions to generate, more like the trained capacity of an oral performer than like a stored text. The print analogy treats weights as a library; they are closer to a memorized repertoire.
Inquiring lines that use this note as a source 53
This note is a source for these synthesized inquiries. Follow a line forward into its question, or open it to trace back to all of its sources.
- How does token generation as flow differ from print's archival storage?
- Why does transformer attention architecture reinforce sycophancy and agreement?
- How do transformers perform multi-hop reasoning across distant training documents?
- Can linguistic compression be a fundamental mechanism for representing psychology?
- Does transformer attention architecture fundamentally prevent topic-aware memory?
- Can transformer attention architecture explain why chatbots default to sycophancy?
- Does information stored in neural networks necessarily influence generation decisions?
- Can autoregressive models be trained to produce more cataphoric text?
- Can symbolic mechanisms improve transformer compositional abilities?
- Why does attention-based drift happen automatically during generation?
- Can autoregressive models learn faithful translation to logical representations without semantic loss?
- How does the discrete token bottleneck prevent gradient flow in language model control?
- What computational role do intermediate tokens actually play in transformers?
- Can transformer attention patterns actually prevent topic context loss in practice?
- Why do transformer models still miss implicit discourse relations in anxiety detection?
- Do representations in models causally influence text generation?
- Can fast-slow separation improve both memory and generation in language models?
- Does encoded knowledge in language models actually influence what they generate?
- When does encoded knowledge fail to influence language model generation?
- What hidden computations happen inside transformer layers during reasoning?
- Does bidirectional attention improve language models as universal encoders?
- Can we decode what individual circuits inside transformers are doing?
- Why might encoded world knowledge fail to actually influence language model outputs?
- Do personality traits and task knowledge occupy separate subspaces in transformer parameters?
- How do description-based identifiers bias language model output distribution?
- How does layer removal affect transformers compared to ResNets?
- Can language models keep secrets and control information strategically?
- Do speech models learn the articulatory processes that produce acoustic signals?
- Can articulatory inversion serve as a window into what speech models have learned?
- What information does transcription destroy that direct speech-to-speech models preserve?
- How do language models transmit traits through semantically unrelated data?
- Why does transformer attention architecture undermine stickiness in model behavior?
- What explains the contextual variability of knowledge in transformers?
- How does oral transmission of knowledge resemble transformer generation?
- How does repeated content shift model outputs across multiple turns?
- How do static embeddings and contextualized representations divide semantic labor?
- How does modeling capability relate to lossless compression in language models?
- What limits the effectiveness of formal language pretraining on transformer architectures?
- What makes modernized N-gram embeddings composable with transformer architectures?
- Do language models and multimodal models show similar attractor-based interpretability?
- How do transformers stitch together learned behaviors when adapting to new tasks?
- Do models cache intentions about response topics before generating the first token?
- Can decoder-only models become effective text encoders with training?
- How does transformer attention bias toward repeated and context-prominent content?
- Does Gemma's transformer explicitly exploit the inherited hierarchical geometry?
- Can spline-based activations replace MLPs in transformer architectures?
- What structural biases does transformer attention have before training?
- How does upward distillation transfer knowledge from smaller to larger networks?
- Is forgetting in language models reversible or permanent knowledge loss?
- How do newly learned facts become accessible after gradient updates?
- How do fixed recurrent states trade off copying accuracy for filtering ability?
- Why does reapplying the same transformer block work better than computing new layers?
- Can looping enable reasoning capabilities that fixed-depth transformers fundamentally cannot achieve?
Related concepts in this collection 3
This note in its neighbourhood — explore the map, then jump to a related concept in the list below.
Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph
-
Does AI-generated content mirror oral culture's knowledge patterns?
Walter Ong's framework for oral versus literate cultures may describe how AI content functions on social media. Understanding this parallel could explain why AI discourse feels fundamentally different from print-era knowledge.
the cultural-form claim that this provides architectural grounding for
-
Is AI returning knowledge to flow-based economies?
Exploring whether AI's on-demand generation mirrors the flow-based knowledge transmission of oral cultures, and how this differs structurally from both print commodification and gift economies.
the broader economic-form claim
-
Is the LLM a tool or a new form of intelligence itself?
Does framing AI as merely delivering pre-existing intelligence miss what's actually happening? This explores whether the model itself constitutes a fundamentally new intelligence-medium with distinct cultural effects.
the medium-theoretic claim about what the model does
Related papers in this collection 8
Papers most semantically related to this note, ranked by cosine similarity in the embedding space.
- A Primer on the Inner Workings of Transformer-based Language Models
- Word Meanings in Transformer Language Models
- Loop, Think, & Generalize: Implicit Reasoning in Recurrent-Depth Transformers
- Emergence of Abstractions: Concept Encoding and Decoding Mechanism for In-Context Learning in Transformers
- Can Looped Transformers Learn to Implement Multi-step Gradient Descent for In-context Learning?
- A Mechanistic Analysis of Looped Reasoning Language Models
- It’s All Connected: A Journey Through Test-Time Memorization, Attentional Bias, Retention, and Online Optimization
- In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss
Original note title
transformer residual streams transmit knowledge as flow not storage — closer to oral transmission than print