Domain Specialization in LLMs

Why do language models fail at temporal reasoning in complex tasks?

Language models correctly answer simple temporal questions but produce logically impossible timelines in complex legal documents. This explores what task features trigger reasoning failures and whether the competence is genuinely lost or masked by surface-level patterns.

Does medical AI need knowledge or reasoning more?

Medical and mathematical domains may require fundamentally different AI training priorities. If medical accuracy depends primarily on factual knowledge while math depends on reasoning quality, should we build and evaluate these systems differently?

Does model access level determine which specialization techniques work?

Different specialization approaches require different levels of access to a model's internals. Understanding this constraint helps practitioners choose realistic techniques for their domain adaptation goals.

Why doesn't mathematical reasoning transfer to medicine?

Can models trained to reason well about math apply those skills to medical domains through fine-tuning? This explores whether reasoning ability is truly domain-agnostic or constrained by domain-specific knowledge requirements.

When do graph databases outperform vector embeddings for retrieval?

Vector similarity struggles with aggregate and relational queries that require traversing multiple entity connections. Can graph-oriented databases with deterministic queries solve this failure mode in enterprise domain applications?

How do knowledge injection methods trade off flexibility and cost?

When and how should domain knowledge enter an AI system? This explores the speed, training cost, and adaptability trade-offs across four injection paradigms, and when each approach suits different deployment constraints.

How often do legal AI tools actually hallucinate citations?

Legal vendors claim their AI research tools eliminate hallucinations, but do they? This preregistered study measures hallucination rates in leading commercial legal-research systems to test those marketing claims.

Why do language models struggle with historical legal cases?

Explores whether LLMs' training data recency bias creates systematic performance degradation on older cases, and what this reveals about how models represent temporal information in specialized domains.

Why do specialized models fail outside their domain?

Deep domain optimization creates sharp performance cliffs at domain boundaries. Specialized models generate plausible-sounding but ungrounded responses when queries fall outside their training scope, and often fail to signal their own ignorance.

Can prompt optimization teach models knowledge they lack?

Explores whether sophisticated prompting techniques can inject new domain knowledge into language models, or if they're limited to activating existing training knowledge.

Can simple rewards alone teach complex domain reasoning?

Does reinforcement learning on difficult problems with basic accuracy rewards produce sophisticated reasoning strategies without explicit chain-of-thought training? This challenges assumptions about what domain AI models need to learn effectively.

Does RL improve domain reasoning by adding knowledge or removing it?

When reinforcement learning improves reasoning in specialized domains like medicine, is it teaching models new facts or preventing them from using wrong ones? Understanding this distinction matters for how we design RL training.

Does supervised fine-tuning actually improve reasoning quality?

While SFT boosts final-answer accuracy, does it degrade the quality and informativeness of the reasoning steps that justify those answers? This matters for high-stakes domains requiring auditable decision-making.

Can organizing knowledge structures beat raw training data volume?

Does structuring domain knowledge into taxonomies during training enable models to learn more efficiently than simply increasing the amount of training data? This challenges assumptions about scaling knowledge injection.

Does supervised fine-tuning improve reasoning or just answers?

Explores whether training models on question-answer pairs actually strengthens their reasoning quality or merely optimizes them toward correct outputs through shortcuts. This matters for deploying AI in domains like medicine where reasoning must be auditable.

Can asynchronous expert training beat synchronized distributed LLM training?

Can training domain-specialized LLM copies in parallel without synchronization, then merging their components into a routed mixture, achieve better efficiency and accuracy than keeping all copies synchronized?