What other semantic relations benefit from explicit surface markers in text?
This explores which kinds of meaning relationships — beyond whatever single relation prompted the question — become easier for models and readers to handle when they're spelled out explicitly in the text rather than left implicit for inference.
This explores which semantic relations get clearer when they carry explicit surface markers, instead of being left for a model or reader to infer. The corpus suggests the pattern is broad: almost any relation that's normally implicit becomes more tractable once it's made visible on the surface — and conversely, relations that stay implicit are exactly where systems break.
Start with syntax. Work on the geometry of activations shows that grammatical relations like subject-of or modifier-of are encoded not just by distance but by *direction* — a kind of internal surface marker for relation type How do language models encode syntactic relations geometrically?. Discourse cohesion is the next layer up: pointing relations between sentences benefit enormously from explicit markers, and human writers and ChatGPT diverge precisely here — humans lean on cataphoric (forward-pointing) cues that preview an argument, while ChatGPT defaults to anaphoric (backward-pointing) summary Does ChatGPT organize text differently than human writers?. The relation is the same (this clause depends on that one), but which direction you signal it changes how a reader builds the structure.
The most striking case is relational structure — joins, hierarchies, cross-references. Long-context models can absorb a whole corpus and still fail the moment a query requires *relating* records across structured tables; raw text length doesn't supply the join markers that a database schema makes explicit Can long-context LLMs replace retrieval-augmented generation systems?. The fix in retrieval work is to add the markers back: hierarchical knowledge graphs encode part-whole and cross-chapter links explicitly, answering global questions that flat chunk retrieval can't reach Can multimodal knowledge graphs answer questions that flat retrieval cannot?. The same logic shows up in agents — accessibility trees give a GUI controller an explicit map of element relations that raw screenshots leave buried, separating grounding from planning Can structured interfaces help language models control GUIs better?.
Why does explicitness matter so much? Because the default mechanism is surface statistics, not relational reasoning. Models systematically prefer high-frequency phrasings over semantically equivalent rare ones, tracking statistical mass rather than meaning Do language models really understand meaning or just surface frequency? — and since frequent words skew abstract, that bias quietly erases the specific relations expert language encodes Does word frequency correlate with semantic abstraction?. When a relation has no surface marker at all — say, a sentence that's deliberately ambiguous in scope or structure — performance collapses: GPT-4 disambiguates only 32% of such cases versus 90% for humans Can language models recognize when text is deliberately ambiguous?.
The thing worth taking away: the relations that *need* explicit markers most are the relations these systems are worst at inferring — relational joins, scope, discourse direction, abstraction level. Surface markers aren't decoration; they're a substitute for reasoning the model isn't reliably doing, which is why so much applied work (graphs, accessibility trees, structured interfaces) is really just the project of re-surfacing relations that plain text leaves implicit.
Sources 8 notes
The Polar Probe shows LLMs represent syntactic type and direction through both distance and angular position between embeddings, nearly doubling accuracy over distance-only methods. This demonstrates neural networks spontaneously learn structured, symbolic-compatible geometry.
ChatGPT defaults to summarizing what was already said, while students use more forward-pointing structure that previews upcoming arguments. This reflects different reader models and may stem from how autoregressive generation works token by token.
The LOFT benchmark shows LCLMs match RAG on semantic retrieval without explicit training, but cannot execute relational queries requiring joins across structured tables. Context length alone cannot bridge this gap.
MegaRAG builds hierarchical multimodal knowledge graphs from text and visuals to answer cross-chapter, global questions that flat chunk retrieval cannot reach. The hierarchy supports abstraction levels from high-level summaries to page-specific details while treating images as first-class graph nodes.
Agent S's dual-input design—visual input for environmental understanding plus image-augmented accessibility trees for grounding—achieved 9.37% improvement over baseline by factoring planning and grounding into separate optimization paths rather than forcing end-to-end prediction.
LLMs show consistent preference for higher-frequency surface forms over semantically equivalent rare paraphrases across math, machine translation, commonsense reasoning, and tool calling. This suggests models track statistical mass from pretraining rather than meaning-recognition as their primary mechanism.
WordNet analysis shows hypernyms (general concepts) occur more frequently than hyponyms (specific ones). Combined with LLMs' frequency bias, this means preferring common paraphrases systematically drifts toward abstraction, erasing expert-level specificity.
AMBIENT benchmark shows GPT-4 correctly disambiguates only 32% of cases versus 90% for humans. This failure spans lexical, structural, and scope ambiguity—revealing that LLMs cannot hold multiple interpretations simultaneously, a fundamental gap hidden by standard benchmarks.