Why does contextual judgment matter more in law and medicine than in mathematics?
This explores why some domains lean on contextual knowledge and human judgment while math leans on portable reasoning — and what the corpus reveals about that split inside AI systems.
This reads the question as asking about a real divide the corpus keeps circling: math is a reasoning-dominant domain, while law and medicine are knowledge-dominant ones where the right answer depends on facts, context, and authority that can't be derived from first principles. The clearest evidence comes from work showing that medical accuracy correlates far more with whether a model *knows* the right thing than with how well it reasons, while mathematical performance shows the inverse — better reasoning, better answers Does medical AI need knowledge or reasoning more?. This is why training a model to reason harder helps it on math but can actively *degrade* it on medicine: pushing the higher network layers that handle reasoning can disturb the lower layers where factual knowledge lives Why does reasoning training help math but hurt medical tasks?.
The deeper reason contextual judgment matters in law and medicine is that reasoning skill doesn't transfer the way you'd hope. A model distilled to be a strong mathematical reasoner fails to beat a plain base model on medical tasks, because no amount of clean inference closes a gap that is really about missing domain-specific knowledge Why doesn't mathematical reasoning transfer to medicine?. Math is self-contained — the chain of steps validates itself. Medicine and law are not: the correct move depends on particulars the reasoner has to already hold, and on judgment about which facts apply here, now, to this case.
What makes this more than a training-data story is *how* contested-domain expertise actually gets settled. In human practice, law and medicine resolve hard questions through argument quality, social authority, cultural context, and interpersonal trust — not through probability. AI systems instead settle disagreements by ranking chain-of-thought likelihoods, and in exactly the contested domains where human judgment matters most, that mismatch amplifies errors rather than correcting them How do LLM debates differ from human expert consensus?. The thing math doesn't need — a social, contextual arbiter — is the thing law and medicine run on.
There's a further layer worth knowing: even where models look competent socially, they master the statistics of norms while missing actual participation and culturally-resonant interpretation Why do AI systems fail at social and cultural interpretation?. Contextual judgment isn't just "more facts" — it's situated meaning-making, the capacity to read what a situation calls for. And reasoning itself may be more about *form* than genuine inference: illogical chain-of-thought exemplars perform nearly as well as valid ones, suggesting models learn the shape of reasoning rather than the substance Does logical validity actually drive chain-of-thought gains?. In math, that imitation of form is often enough to land the answer. In law and medicine, the form without the situated knowledge and the authority to judge is precisely where it breaks.
The quietly surprising takeaway: the math-vs-medicine gap isn't about difficulty. Medicine isn't "harder reasoning" — it's a different *kind* of competence, one where knowing and judging-in-context outrank deriving, and where the engine that makes AI good at math is the same engine that can make it worse at the things humans most want judgment for.
Sources 6 notes
The KI/InfoGain framework reveals that medical domain accuracy correlates more strongly with knowledge correctness than reasoning quality, while mathematical domains show the inverse pattern. This distinction has direct implications for which training strategies to prioritize in each domain.
Two-phase inference model shows knowledge retrieval operates in lower network layers while reasoning adjustment happens in higher layers. This separation explains why reasoning training improves math but can degrade knowledge-intensive domains like medicine.
R1-distilled reasoning models fail to outperform base models on medical tasks because knowledge accuracy matters more than reasoning quality in medicine—the opposite of math. Fine-tuning cannot close this gap without domain-specific training data.
Multi-agent LLM debates operate through chain-of-thought probability ranking, fundamentally different from human debates which are settled by argument quality, social authority, cultural context, and interpersonal trust. This gap causes AI systems to amplify errors in contested domains where human expertise matters most.
LLMs achieve 100th-percentile performance on norm prediction yet regress on theory-of-mind tasks and cannot generate culturally-resonant interpretations. The pattern shows that statistical competence coexists with absence of actual social understanding and participation.
Illogical chain-of-thought exemplars matched valid CoT performance on BIG-Bench Hard, showing that structural properties—not logical validity—drive the gains. The model learns the form of reasoning, not genuine inference.