SYNTHESIS NOTE

Can LLMs learn reliably at test time without human oversight?

How can language models adapt to rapidly changing rules and knowledge during inference rather than waiting for retraining? What prevents fully autonomous systems from handling conflicting information?

Synthesis note · 2026-05-18 · sourced from Reinforcement Learning

Many real-world domains have rules that change faster than training cycles can keep up with — regulatory compliance, user risk screening, evolving customer policies. Standard remedies fail at the boundary: offline fine-tuning lags reality; ICL provides examples but cannot integrate novel rules; RAG retrieves but cannot reconcile contradictions; test-time fine-tuning adjusts parameters but cannot easily handle conflict between old and new knowledge.

ARIA (2507.17131) proposes that test-time learning requires three things existing methods do not combine: (1) structured uncertainty self-assessment, (2) a timestamped knowledge base with conflict detection, and (3) HITL clarification queries when contradictions surface. Each component does specific work that the others cannot.

Structured self-dialogue is the uncertainty assessor. Rather than relying on confidence thresholds — which are notoriously poorly calibrated — ARIA generates a reflective Q&A about its own preliminary judgment: questioning implicit assumptions, recalling related prior experiences, identifying domain-knowledge gaps. This converts confidence assessment into an inspectable reasoning trace that surfaces what kind of uncertainty exists (factual gap, procedural ambiguity, conflict with prior rule). The structure is what makes the assessment more reliable than a scalar confidence score.

Timestamped knowledge repository is the conflict detector. Each acquired knowledge item is stored with its acquisition timestamp. When new knowledge arrives, ARIA retrieves related entries by semantic matching and compares them against the new information. Inconsistencies are flagged. Older entries are marked as potentially obsolete rather than deleted — preserving history while signaling currency.

Active clarification queries are the conflict resolver. When contradictions surface, ARIA does not silently choose one version or fail. It generates targeted queries back to human experts: "rule X dated 2025-03 said Y; new guidance dated 2025-09 implies not-Y; please clarify." The human-mediated resolution is the load-bearing step — the system does not attempt to autonomously adjudicate between conflicting rules.

The deeper claim is about what kind of learning AGI needs. Strong-AI fantasies of fully autonomous adaptation collapse on the conflict-resolution problem. When old and new knowledge disagree, no purely autonomous system can reliably pick the right resolution because the choice depends on context outside the system (regulatory authority, organizational priority, expert judgment). ARIA accepts this constraint and designs the human-mediated loop as a first-class component rather than a fallback.

For deployment, this is the architectural pattern for any system operating in a rapidly-changing domain where the cost of acting on outdated rules is high.

Inquiring lines that read this note 20

This note is a source for these research framings, grouped by the broader line of inquiry each explores. Scan the bold lines of inquiry; follow any specific question forward.

How do evaluation biases undermine LLM quality assessment systems?

Can LLMs evaluate their own observations without external feedback?

How can LLM user simulators model realistic goal-driven conversation?

Can parallel agents or complementary mechanisms replace single-human interrogation of LLMs?

How should we design LLM systems to maintain alignment and control?

What deployment feedback loops amplify LLM pretraining popularity in live systems?

Does self-reflection enable models to reliably correct their errors?

What makes self-modifying architectures learn their own update rules?

What articulatory information do speech signals carry that text cannot?

Can multimodal LLMs be made to spontaneously adapt their language for efficiency?

How do knowledge injection methods compare across cost and effectiveness?

How do training-time and inference-time knowledge injection techniques compare?

How faithfully do LLMs reflect their actual reasoning in outputs and explanations?

Why do language models reinforce false assumptions instead of correcting them?

Can fact-checking systems use LLMs reliably if models abandon correct positions under pressure?

Do accurate-looking LLM outputs hide structural failures in learning and reasoning?

Where do LLMs fail as knowledge systems compared to humans?

How can models identify insufficient information and respond appropriately without guessing?

What alternatives exist when required knowledge is absent from training?

How do language models inherit human biases from training data?

Can LLMs coordinate with humans better using different model architectures?

What capability tradeoffs emerge when scaling model reasoning abilities?

What test-time strategies did o3 discover without human specification?

How should inference compute be adaptively allocated based on prompt difficulty?

What deployment tradeoffs emerge between single-pass and multi-pass inference adaptation?

What pretraining choices and baseline capability constrain reinforcement learning gains?

How does LLM simulation of APIs avoid instability without sacrificing training signal?

Why do continual learning scenarios trigger catastrophic forgetting and interference?

Can AI models retain knowledge across changing environments without catastrophic forgetting?

Why does finetuning cause catastrophic forgetting of model capabilities?

Where does skill extraction fail compared to genuine model adaptation?

How do training priors constrain what context information can override?

What makes some contexts learnable as rules versus requiring model retraining?

How do language models establish social grounding in human dialogue?

Why do LLMs lack the communicative scaffold that humans learn?

Can AI-generated outputs constitute genuine knowledge or valid claims?

Can autonomous systems ever resolve contradictions between old and new rules?

Related concepts in this collection 3

This note in its neighbourhood — explore the map, then jump to a related concept in the list below.

Concept map

15 direct connections · 132 in 2-hop network ·medium cluster Open in graph ↗

Can LLMs learn reliably at test time without hum… How should agents decide what memories to keep? Does agent memory degrade when continuously consol… Can three axes replace the short-term long-term me…

Click a node to walk · click center to open · click Open in graph to see this note in the full knowledge graph

your link semantically near linked from elsewhere

How should agents decide what memories to keep? Agent memory management splits between agents autonomously recognizing important information versus programmatic triggers. Understanding this choice reveals why different memory architectures prioritize different information types.
ARIA implements the hot-path with structured self-dialogue as the importance-recognition mechanism
Does agent memory degrade when continuously consolidated? Can consolidating agent experiences into summaries actually harm long-term performance? Research on ARC-AGI tasks suggests continuous memory updates may reduce capability below the no-memory baseline.
ARIA's timestamped retention avoids the consolidation regression by NOT consolidating across rules; it keeps each rule episodic with conflict markers
Can three axes replace the short-term long-term memory split? Does breaking agent memory into forms, functions, and dynamics provide a clearer framework than the traditional short-term/long-term distinction? This matters because current agent-memory literature lacks a unified vocabulary, making comparison between systems nearly impossible.
ARIA's knowledge base is token-form, working-function, with formation-via-HITL and evolution-via-conflict-resolution — a specific point in the design space

Related papers in this collection 8

Papers most semantically related to this note, ranked by cosine similarity in the embedding space.

Original note title

test-time learning requires structured self-dialogue plus timestamped knowledge base with conflict-resolution queries

Can LLMs learn reliably at test time without human oversight?

Inquiring lines that read this note 20

Related concepts in this collection 3

Related papers in this collection 8

Search by related questions 4