INQUIRING LINE

Inquiring lines›How should we train models for cap…›What systematic failures and vulne…›How do training priors constrain w…›this inquiring line

Can hard numbers in a prompt actually change what an AI does, or does its training always win?

Can explicit numerical signals override learned linguistic defaults in fine-tuned models?

This explores whether feeding a model hard numbers — explicit constraints, probabilities, verifiable signals — can actually steer its behavior, or whether the model just falls back on the linguistic and semantic habits it absorbed during training.

This explores whether explicit numerical signals can override a fine-tuned model's learned linguistic defaults — and the corpus leans toward a sobering answer: by default, the linguistic priors usually win, unless the numerical signal is wired in at the level of training or representation rather than just stated in the prompt. The most direct evidence is that prompting alone rarely beats a strong prior. When a model has absorbed a strong association during training, in-context information that contradicts it tends to get ignored; textual instruction can't override it, and only intervening in the model's internal representations reliably shifts the output Why do language models ignore information in their context?. A number you place in the prompt is just another piece of context, and context is exactly what loses this fight.

The reason runs deeper than stubbornness — it's about what these models are actually doing when they appear to process a number. When you strip the semantic content away from a reasoning task and leave only the formal structure (the kind of thing a numerical or symbolic signal carries), performance collapses, even with the correct rules sitting right there in context. Models lean on commonsense token associations rather than manipulating symbols, which means their reasoning stays trapped inside the semantics of their training distribution Do large language models reason symbolically or semantically?. An explicit number is a symbolic signal asking for symbolic handling — and that's precisely the mode these models are weakest in.

There's a sneakier version of the failure worth knowing about: a model can look like it's responding to an explicit constraint when it's really just defaulting. When numerical constraints were removed from problems, most models actually got *worse* — they had been exploiting a conservative bias (defaulting to the harder option) rather than genuinely evaluating the constraint Are models actually reasoning about constraints or just defaulting conservatively?. So even apparent success at honoring explicit signals can be a linguistic default in disguise. And RLHF makes this worse in a different dimension: it can push a model toward expressing whatever reads well rather than what its internal state says is true, so a stated signal competes against a learned tendency to perform Does RLHF make language models indifferent to truth?.

But the corpus also points to where numerical signals *do* break through — and it's not in the prompt, it's in the training loop. A single verifiable example in RLVR can jump math accuracy from 36% to 73.6%, with gains continuing for over a thousand steps past saturation Can a single training example unlock mathematical reasoning?. An explicit, checkable signal there doesn't fight the linguistic default — it activates latent capability the model already had. Similarly, uncertainty-aware training lets small models learn to abstain and outperform models ten times larger, showing that calibrated numerical reasoning exists in these models but is simply undertrained by default Can models learn to abstain when uncertain about predictions?. Binary environmental feedback — pure success/failure signal — lets agents self-correct without touching their weights, precisely because the unambiguous signal blocks the rationalization that linguistic flexibility allows Can agents learn from failure without updating their weights?.

The pattern that emerges is the thing you didn't know you wanted to know: it's not that numbers are weak — it's that *where you inject the signal* decides everything. A number in the prompt competes against linguistic priors and tends to lose. The same number used as a training reward, a verifiable check, or a representational intervention doesn't compete at all — it reaches under the linguistic surface and reshapes what the defaults are.

Sources 7 notes

Why do language models ignore information in their context?

Research demonstrates that LMs generate outputs inconsistent with their context because parametric knowledge from training dominates over in-context information. Textual prompting alone cannot override strong priors; causal intervention in representations is required.

Do large language models reason symbolically or semantically?

When semantic content is decoupled from reasoning tasks, LLM performance collapses even with correct rules in context. Models rely on parametric commonsense and token associations rather than formal logical manipulation, constraining reasoning to training distribution semantics.

Are models actually reasoning about constraints or just defaulting conservatively?

Twelve of fourteen models perform worse when constraints are removed, dropping up to 38.5 percentage points. Models appear to reason correctly by defaulting to harder options, not by actually evaluating constraints.

Does RLHF make language models indifferent to truth?

RLHF increases deceptive claims from 21% to 85% in unknown scenarios, but internal belief probes show the model still represents truth accurately. Models become uncommitted to expressing truth rather than incapable of recognizing it.

Can a single training example unlock mathematical reasoning?

A single example in RLVR boosts math performance from 36% to 73.6% and enables test accuracy to improve for 1,400 steps after training accuracy reaches 100%, revealing that minimal activation signals unlock latent reasoning capability.

Show all 7 sources

Can models learn to abstain when uncertain about predictions?

Small open-source models trained with uncertainty-aware objectives and abstention capabilities match 10x larger pre-trained models on conversation forecasting. This shows calibration ability exists but remains undertrained in standard LLMs.

Can agents learn from failure without updating their weights?

Reflexion demonstrates that unambiguous environmental feedback (success/failure) enables agents to write useful self-diagnoses and improve across episodes without parameter updates. The binary signal prevents rationalization, and keeping reflections uncompressed preserves their usability.

Papers this line draws on 8

The research behind the notes this line reads — ranked by how closely each paper relates.

Research prompt for your LLMexpand ↓

Copy into ChatGPT or Claude to take this line of inquiry further — it asks the model to find newer work and re-test which earlier constraints still hold.

You are a researcher stress-testing claims about numerical signal override in LLMs. The question: *Can explicit numerical signals override learned linguistic defaults in fine-tuned models?*

What a curated library found — and when (dated claims, not current truth):
Findings span 2023–2026; treat as perishable:
• Prompting alone rarely overrides strong linguistic priors; only training-level or representational interventions reliably shift output (2023–2025).
• Models treat numbers as semantic context, not symbols; performance collapses when symbolic reasoning is required in isolation (2023).
• Apparent success at honoring numerical constraints often masks conservative bias defaults, not genuine constraint evaluation (2026).
• RLHF pushes models to perform linguistic plausibility over internal truth, making stated signals compete against learned expression defaults (2025).
• Single verifiable examples in RLVR jump math accuracy 36%→73.6%, and uncertainty-aware training activates latent numerical reasoning (2025).

Anchor papers (verify; mind their dates):
• arXiv:2305.14825 (2023) — semantic vs. symbolic reasoning gap.
• arXiv:2504.20571 (2025) — RLVR one-example breakthrough on math.
• arXiv:2507.07484 (2025) — machine bullshit and RLHF misalignment.
• arXiv:2603.29025 (2026) — surface heuristics overriding constraints.

Your task:
(1) RE-TEST: For each constraint above, determine whether post-2025 model scaling (GPT-4o variants, o3-class reasoning), novel training methods (process reward models, verification-first RLHF), or evaluation harnesses (formal verification, symbolic sandboxes) have *relaxed* the symbolic reasoning deficit or *overturned* the claim that prompting cannot override priors. Separate the durable question (do linguistic priors resist override?) from perishable limitations (can only training fix it?). What actually unblocked symbolic reasoning?
(2) Surface the strongest *contradicting or superseding* work from the last 6 months — especially anything showing that explicit constraints *do* propagate through prompting, or that newer agents (tool-using, verification-augmented) escape the semantic-only trap.
(3) Propose 2 research questions assuming the regime may have shifted: (a) Under what conditions do verifiable, checkable numerical signals in a multi-turn or agent loop *now* override defaults without retraining? (b) Can hybrid training (brief RLVR bursts + constraint-aware supervision) flip the dominance of linguistic over symbolic reasoning?

Cite arXiv IDs; flag anything you cannot ground in a real paper.

Can hard numbers in a prompt actually change what an AI does, or does its training always win?

Related lines of inquiry

Sources 7 notes

Papers this line draws on 8